This repository presents a step by step approach for data wrangling, descriptive statistical analysis, predictive analysis, model development, model evaluation, and decision making. This project is a part of Data Analysis with Python course offered by Coursera.org. The dataset includes auto info provided in the course and could be downloaded from IBM cloud.
✓ Link to the dataset: Link
✓ link to the description of each column of the dataset Link
✓ Link to the notebook: Link
In this study:
- the following steps are carried out to address the missing values in the dataset, including:
- replacing the missing values (? in here) with np.nan
- finding the columns which include missing values and counting the number of elements with missing values
- replacing the missing values with the average of the values in the column
- the following steps are taken in order to prepare the data for data analysis:
- normalizing the values based on the (value - average(column))/standard_deviation(column)
- binning the columns into categorical groups (e.g., low, medium, and high)
- Turning categorical variables into quantitative variables (e.g., 0 and 1)
- statistical descriptive analyses are performed using:
- Chi-Square and analysis of variance (ANOVA) methods for columns with object data type
- Pearsonr method for columns with numerical data type
- In_sample testing:
- splitting the data into train and test data with test data include 30% of the overall data
- model development is performed using:
- Simple linear regression model
- Multi-linear regression model
- 1-dimensional polynomial regression model
- Multi-dimensional polynomial regression model
- Ridge regression model
- Grid search to find the parameter in the Ridge model (alpha) which leads to the highest R-square
- model evaluation and decision making is carried out using the following statistical methods:
- Mean Square Error (MSE)
- R-Square
- Cross validation
- Farhad Davaripour is a finite element specialist/data science enthusiast with near 3 years of experience working in research and development roles. He has a knack for problem-solving and passion for data science (He is certified with IBM Data Science Professional Certificate).
- Connect with Farhad on LinkedIn.