Practical-App-11

OVERVIEW

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing. Your goal is to understand what factors make a car more or less expensive. As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

Business Understanding

The data task is to identify the key features or variables that most significantly influence used car prices. This involves conducting exploratory data analysis (EDA) to understand the relationships between these features and prices, and then applying statistical modeling techniques such as regression analysis to quantify the impact of these features on prices. The goal is to develop a predictive model that can accurately estimate the price of a used car based on its characteristics.

Data Understanding

After considering the business understanding, we want to get familiar with our data. Write down some steps that you would take to get to know the dataset and identify any quality issues within. Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 426880 │ │ category    │ 13    │ │ region                │                                │
│ │ Number of columns │ 18     │ │ int64       │ 2     │ │ manufacturer          │                                │
│ └───────────────────┴────────┘ │ float64     │ 2     │ │ model                 │                                │
│                                │ string      │ 1     │ │ condition             │                                │
│                                └─────────────┴───────┘ │ cylinders             │                                │
│                                                        │ fuel                  │                                │
│                                                        │ title_status          │                                │
│                                                        │ transmission          │                                │
│                                                        │ drive                 │                                │
│                                                        │ size                  │                                │
│                                                        │ type                  │                                │
│                                                        │ paint_color           │                                │
│                                                        │ state                 │                                │
│                                                        └───────────────────────┘                                │
│                                                     number                                                      │
│ ┏━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_n                                                                                         ┃  │
│ ┃ ame       NA    NA %  mean      sd        p0        p25       p50      p75       p100     hist   ┃  │
│ ┡━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ id         0   073000000 450000072000000730000007300000730000007300000 │  │
│ │          │      │      │       00 │          │       00      00    000      00    000 │        │  │
│ │ price      0   0   7500012000000       0    5900  14000   260003700000 │  │
│ │          │      │      │          │          │          │          │         │          │     000 │        │  │
│ │ year    12050.28    2000     9.5    1900    2000   2000    2000   2000 │  │
│ │ odometer44001.03   98000  210000       0   38000  86000  1300001000000 │  │
│ │          │      │      │          │          │          │          │         │          │       0 │        │  │
│ └──────────┴──────┴──────┴──────────┴──────────┴──────────┴──────────┴─────────┴──────────┴─────────┴────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                     NA                NA %            ordered              unique          ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩  │
│ │ region                                       0             0False                          404 │  │
│ │ manufacturer                             17646          4.13False                           43 │  │
│ │ model                                     5277          1.24False                        29650 │  │
│ │ condition                               174104         40.79False                            7 │  │
│ │ cylinders                               177678         41.62False                            9 │  │
│ │ fuel                                      3013          0.71False                            6 │  │
│ │ title_status                              8242          1.93False                            7 │  │
│ │ transmission                              2556           0.6False                            4 │  │
│ │ drive                                   130567         30.59False                            4 │  │
│ │ size                                    306361         71.77False                            5 │  │
│ │ type                                     92858         21.75False                           14 │  │
│ │ paint_color                             130203          30.5False                           13 │  │
│ │ state                                        0             0False                           51 │  │
│ └────────────────────────────────┴──────────────────┴────────────────┴─────────────────────┴─────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                NA       NA %        words per row                 total words              ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ VIN                            0         0                           1                  426880 │  │
│ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

There are 13 categorical columns, and three numerical. The skewness and kurtosis factor of the price is pretty high. We can observe that the distribution of prices shows a high positive skewness to the left (skew > 1). A kurtosis value of 69205 is extremely high, meaning that there is a profusion of outliers in the dataset. We need to do something. Skewness: 254.406932 Kurtosis: 69205.088590

The data was cleaned based on the columns that are highest correlated to the price. The NAs in these columns were dropped and for the remaining columns, they were imputed.

Post cleaning, and NA imputation, from the 400K data points, we were left with ~91K data points of which 70% was used for training purposes.

Data Cleaning Critical Steps

Prices tend to be higher as cars are more recent, and viceversa. Outliers on Price is removed based on 1.5 times the IQR values.

Modelling:

Ridge regression was performed to extract coeffiecients. The best model was identified to obtain coefficients by using Gird Search on hyper parameters. Once the coefficients were established, the data was run through sequential feature selection for Lasso Regression.

Findings

From the results, it looks like the year, number of cylinders, fuel type and transmission type are the most important features for predicting car price. As as car dealer, it would be useful if you advertise year car was manufactured, the number of cylinders, transmission type and fuel.