OVERVIEW
In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing. Your goal is to understand what factors make a car more or less expensive. As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.
The data task is to identify the key features or variables that most significantly influence used car prices. This involves conducting exploratory data analysis (EDA) to understand the relationships between these features and prices, and then applying statistical modeling techniques such as regression analysis to quantify the impact of these features on prices. The goal is to develop a predictive model that can accurately estimate the price of a used car based on its characteristics.
After considering the business understanding, we want to get familiar with our data. Write down some steps that you would take to get to know the dataset and identify any quality issues within. Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types Categories │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ ┃ Categorical Variables ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ Number of rows │ 426880 │ │ category │ 13 │ │ region │ │ │ │ Number of columns │ 18 │ │ int64 │ 2 │ │ manufacturer │ │ │ └───────────────────┴────────┘ │ float64 │ 2 │ │ model │ │ │ │ string │ 1 │ │ condition │ │ │ └─────────────┴───────┘ │ cylinders │ │ │ │ fuel │ │ │ │ title_status │ │ │ │ transmission │ │ │ │ drive │ │ │ │ size │ │ │ │ type │ │ │ │ paint_color │ │ │ │ state │ │ │ └───────────────────────┘ │ │ number │ │ ┏━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_n ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │ │ ┃ ame ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ id │ 0 │ 0 │ 73000000 │ 4500000 │ 72000000 │ 73000000 │ 7300000 │ 73000000 │ 7300000 │ ▇ │ │ │ │ │ │ │ 00 │ │ 00 │ 00 │ 000 │ 00 │ 000 │ │ │ │ │ price │ 0 │ 0 │ 75000 │ 12000000 │ 0 │ 5900 │ 14000 │ 26000 │ 3700000 │ ▇ │ │ │ │ │ │ │ │ │ │ │ │ │ 000 │ │ │ │ │ year │ 1205 │ 0.28 │ 2000 │ 9.5 │ 1900 │ 2000 │ 2000 │ 2000 │ 2000 │ ▇ │ │ │ │ odometer │ 4400 │ 1.03 │ 98000 │ 210000 │ 0 │ 38000 │ 86000 │ 130000 │ 1000000 │ ▇ │ │ │ │ │ │ │ │ │ │ │ │ │ 0 │ │ │ │ └──────────┴──────┴──────┴──────────┴──────────┴──────────┴──────────┴─────────┴──────────┴─────────┴────────┘ │ │ category │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ ordered ┃ unique ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ │ │ region │ 0 │ 0 │ False │ 404 │ │ │ │ manufacturer │ 17646 │ 4.13 │ False │ 43 │ │ │ │ model │ 5277 │ 1.24 │ False │ 29650 │ │ │ │ condition │ 174104 │ 40.79 │ False │ 7 │ │ │ │ cylinders │ 177678 │ 41.62 │ False │ 9 │ │ │ │ fuel │ 3013 │ 0.71 │ False │ 6 │ │ │ │ title_status │ 8242 │ 1.93 │ False │ 7 │ │ │ │ transmission │ 2556 │ 0.6 │ False │ 4 │ │ │ │ drive │ 130567 │ 30.59 │ False │ 4 │ │ │ │ size │ 306361 │ 71.77 │ False │ 5 │ │ │ │ type │ 92858 │ 21.75 │ False │ 14 │ │ │ │ paint_color │ 130203 │ 30.5 │ False │ 13 │ │ │ │ state │ 0 │ 0 │ False │ 51 │ │ │ └────────────────────────────────┴──────────────────┴────────────────┴─────────────────────┴─────────────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ VIN │ 0 │ 0 │ 1 │ 426880 │ │ │ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
There are 13 categorical columns, and three numerical. The skewness and kurtosis factor of the price is pretty high. We can observe that the distribution of prices shows a high positive skewness to the left (skew > 1). A kurtosis value of 69205 is extremely high, meaning that there is a profusion of outliers in the dataset. We need to do something. Skewness: 254.406932 Kurtosis: 69205.088590
The data was cleaned based on the columns that are highest correlated to the price. The NAs in these columns were dropped and for the remaining columns, they were imputed.
Post cleaning, and NA imputation, from the 400K data points, we were left with ~91K data points of which 70% was used for training purposes.
Prices tend to be higher as cars are more recent, and viceversa. Outliers on Price is removed based on 1.5 times the IQR values.
Ridge regression was performed to extract coeffiecients. The best model was identified to obtain coefficients by using Gird Search on hyper parameters. Once the coefficients were established, the data was run through sequential feature selection for Lasso Regression.
From the results, it looks like the year, number of cylinders, fuel type and transmission type are the most important features for predicting car price. As as car dealer, it would be useful if you advertise year car was manufactured, the number of cylinders, transmission type and fuel.