/Step_by_Step_Data_Analysis_Auto_Dataset

step by step approach for data wrangling, descriptive statistical analysis, predictive analysis, model development, model evaluation, and decision making.

Primary LanguageJupyter Notebook

Hands-on Practice Learning Lab for Data Science

Overview


This repository presents a step by step approach for data wrangling, descriptive statistical analysis, predictive analysis, model development, model evaluation, and decision making. This project is a part of Data Analysis with Python course offered by Coursera.org. The dataset includes auto info provided in the course and could be downloaded from IBM cloud.

✓ Link to the dataset: Link
✓ link to the description of each column of the dataset Link
✓ Link to the notebook: Link

In this study:

  1. the following steps are carried out to address the missing values in the dataset, including:
  • replacing the missing values (? in here) with np.nan
  • finding the columns which include missing values and counting the number of elements with missing values
  • replacing the missing values with the average of the values in the column
  1. the following steps are taken in order to prepare the data for data analysis:
  • normalizing the values based on the (value - average(column))/standard_deviation(column)
  • binning the columns into categorical groups (e.g., low, medium, and high)
  • Turning categorical variables into quantitative variables (e.g., 0 and 1)
  1. statistical descriptive analyses are performed using:
  • Chi-Square and analysis of variance (ANOVA) methods for columns with object data type
  • Pearsonr method for columns with numerical data type
  1. In_sample testing:
  • splitting the data into train and test data with test data include 30% of the overall data
  1. model development is performed using:
  • Simple linear regression model
  • Multi-linear regression model
  • 1-dimensional polynomial regression model
  • Multi-dimensional polynomial regression model
  • Ridge regression model
  • Grid search to find the parameter in the Ridge model (alpha) which leads to the highest R-square
  1. model evaluation and decision making is carried out using the following statistical methods:
  • Mean Square Error (MSE)
  • R-Square
  • Cross validation

About The Author

image

  • Farhad Davaripour is a finite element specialist/data science enthusiast with near 3 years of experience working in research and development roles. He has a knack for problem-solving and passion for data science (He is certified with IBM Data Science Professional Certificate).
  • Connect with Farhad on LinkedIn.