Skip to content

Repository including the code for some projects that have to be done to receive the certificate on freecodecamp.org

Notifications You must be signed in to change notification settings

LaiqianDS/fcc-DataAnalysisWithPython

Repository files navigation

Data Analysis With Python

Repository including the code for some projects that have to be done to receive the certificate on freecodecamp.org

Replit repositories: https://replit.com/@LaiqianJi

Mean-Variance-Standard Deviation Calculator

Create a function named calculate() in mean_var_std.py that uses Numpy to output the mean, variance, standard deviation, max, min, and sum of the rows, columns, and elements in a 3 x 3 matrix. The input of the function should be a list containing 9 digits. The function should convert the list into a 3 x 3 Numpy array, and then return a dictionary containing the mean, variance, standard deviation, max, min, and sum along both axes and for the flattened matrix. The returned dictionary should follow this format:

{ 'mean': [axis1, axis2, flattened], 'variance': [axis1, axis2, flattened], 'standard deviation': [axis1, axis2, flattened], 'max': [axis1, axis2, flattened], 'min': [axis1, axis2, flattened], 'sum': [axis1, axis2, flattened] }

If a list containing less than 9 elements is passed into the function, it should raise a ValueError exception with the message: "List must contain nine numbers." The values in the returned dictionary should be lists and not Numpy arrays.

For example, calculate([0,1,2,3,4,5,6,7,8]) should return:

{ 'mean': [[3.0, 4.0, 5.0], [1.0, 4.0, 7.0], 4.0], 'variance': [[6.0, 6.0, 6.0], [0.6666666666666666, 0.6666666666666666, 0.6666666666666666], 6.666666666666667], 'standard deviation': [[2.449489742783178, 2.449489742783178, 2.449489742783178], [0.816496580927726, 0.816496580927726, 0.816496580927726], 2.581988897471611], 'max': [[6, 7, 8], [2, 5, 8], 8], 'min': [[0, 1, 2], [0, 3, 6], 0], 'sum': [[9, 12, 15], [3, 12, 21], 36] }

Demographic Data Analyzer

In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database. You must use Pandas to answer the following questions:

  • How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
  • What is the average age of men?
  • What is the percentage of people who have a Bachelor's degree?
  • What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
  • What percentage of people without advanced education make more than 50K?
  • What is the minimum number of hours a person works per week?
  • What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
  • What country has the highest percentage of people that earn >50K and what is that percentage?
  • Identify the most popular occupation for those who earn >50K in India.

Medical Data Visualizer

The rows in the dataset represent patients and the columns represent information like body measurements, results from various blood tests, and lifestyle choices. You will use the dataset to explore the relationship between cardiac disease, body measurements, blood markers, and lifestyle choices.

File name: medical_examination.csv
Tasks

Create a chart similar to examples/Figure_1.png, where we show the counts of good and bad outcomes for the cholesterol, gluc, alco, active, and smoke variables for patients with cardio=1 and cardio=0 in different panels.

Use the data to complete the following tasks in medical_data_visualizer.py:

  • Add an overweight column to the data. To determine if a person is overweight, first calculate their BMI by dividing their weight in kilograms by the square of their height in meters. If that value is > 25 then the person is overweight. Use the value 0 for NOT overweight and the value 1 for overweight.
  • Normalize the data by making 0 always good and 1 always bad. If the value of cholesterol or gluc is 1, make the value 0. If the value is more than 1, make the value 1.
  • Convert the data into long format and create a chart that shows the value counts of the categorical features using seaborn's catplot(). The dataset should be split by 'Cardio' so there is one chart for each cardio value. The chart should look like examples/Figure_1.png.
  • Clean the data. Filter out the following patient segments that represent incorrect data:
    diastolic pressure is higher than systolic (Keep the correct data with (df['ap_lo'] <= df['ap_hi']))
    height is less than the 2.5th percentile (Keep the correct data with (df['height'] >= df['height'].quantile(0.025)))
    height is more than the 97.5th percentile
    weight is less than the 2.5th percentile
    weight is more than the 97.5th percentile
  • Create a correlation matrix using the dataset. Plot the correlation matrix using seaborn's heatmap(). Mask the upper triangle. The chart should look like examples/Figure_2.png.
Feature Variable Type Variable Value Type
Age Objective Feature age int (days)
Height Objective Feature height int (cm)
Weight Objective Feature weight float (kg)
Gender Objective Feature gender categorical code
Systolic blood pressure Examination Feature ap_hi int
Diastolic blood pressure Examination Feature ap_lo int
Cholesterol Examination Feature cholesterol 1: normal, 2: above normal, 3: well above normal
Glucose Examination Feature gluc 1: normal, 2: above normal, 3: well above normal
Smoking Subjective Feature smoke binary
Alcohol intake Subjective Feature alco binary
Physical activity Subjective Feature active binary
Presence or absence of cardiovascular disease Target Variable cardio binary

Page View Time Series Visualizer

For this project you will visualize time series data using a line chart, bar chart, and box plots. You will use Pandas, Matplotlib, and Seaborn to visualize a dataset containing the number of page views each day on the freeCodeCamp.org forum from 2016-05-09 to 2019-12-03. The data visualizations will help you understand the patterns in visits and identify yearly and monthly growth.

Use the data to complete the following tasks:

  • Use Pandas to import the data from "fcc-forum-pageviews.csv". Set the index to the date column.
  • Clean the data by filtering out days when the page views were in the top 2.5% of the dataset or bottom 2.5% of the dataset.
  • Create a draw_line_plot function that uses Matplotlib to draw a line chart similar to "examples/Figure_1.png". The title should be Daily freeCodeCamp Forum Page Views 5/2016-12/2019. The label on the x axis should be Date and the label on the y axis should be Page Views.
  • Create a draw_bar_plot function that draws a bar chart similar to "examples/Figure_2.png". It should show average daily page views for each month grouped by year. The legend should show month labels and have a title of Months. On the chart, the label on the x axis should be Years and the label on the y axis should be Average Page Views.
  • Create a draw_box_plot function that uses Seaborn to draw two adjacent box plots similar to "examples/Figure_3.png". These box plots should show how the values are distributed within a given year or month and how it compares over time. The title of the first chart should be Year-wise Box Plot (Trend) and the title of the second chart should be Month-wise Box Plot (Seasonality). Make sure the month labels on bottom start at Jan and the x and y axis are labeled correctly. The boilerplate includes commands to prepare the data.

Sea level predictor

You will analyze a dataset of the global average sea level change since 1880. You will use the data to predict the sea level change through year 2050.

Use the data to complete the following tasks:

  • Use Pandas to import the data from epa-sea-level.csv.
  • Use matplotlib to create a scatter plot using the Year column as the x-axis and the CSIRO Adjusted Sea Level column as the y-axis.
  • Use the linregress function from scipy.stats to get the slope and y-intercept of the line of best fit. Plot the line of best fit over the top of the scatter plot. Make the line go through the year 2050 to predict the sea level rise in 2050.
  • Plot a new line of best fit just using the data from year 2000 through the most recent year in the dataset. Make the line also go through the year 2050 to predict the sea level rise in 2050 if the rate of rise continues as it has since the year 2000. The x label should be Year, the y label should be Sea Level (inches), and the title should be Rise in Sea Level.

About

Repository including the code for some projects that have to be done to receive the certificate on freecodecamp.org

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages