- There are 426880 unique records of cars.
- Descriptors variables like VIN, drive, type, size, paint color, condition & cylinders seems have 30% or less data. These variables do not change much for a car as well and hence have been excluded these from the analysis.
- Amongst numeric columns (odometer, price, , year), no fetures are highly correlated to price. Hence, no features are removed.
- Removed "ID, VIN, model" variables which have high cardinality.
- There are 2 unique numeric columns in the dataset.
- Eight categorical variables - 'region', 'manufacturer', 'fuel', 'title_status', 'transmission', 'type', 'state'
- "Year' is an ordinal feature
- I created a column transformer to a) standardize / scale the number values from 0 to 1 and b) One hot encoded the categorical columns. Left with a transformed dataframe with 532 features.
- Since 532 columns after onehot encoding and its are taking long time for processing, I dropped columns of states where price is less than the mean price of all cars -> resulting in dataframe with 132 features/columns
- Price seems to be in increasing trend post year - 2000
- States - California followed by Oregon and Delaware has highest prices
- Manufactures - Toyota, Chevolet and Meercedes-benz has highest sale prices
- Lasso Regressor: Fine tuned the lasso model using gridsearch cross validation to find optimal parameters: alpha & max_iter (max iterations)
Train mean squared error score is 1.330 Test mean squared error score is 0.0026 Mean Test Score of best estimator model is -1109.4
Given that we can say that model fits and predicts the data well wth low variance. Cross validation picked the best with model with 'copy_X' = True & 'fit_intercept' = False
- Lasso Regression Model 2 Result Summary:
Train mean absolute score is 1.3333 Test mean absolute score is 0.00005087 Given that we can say that lasso model overfits the test data wth low variance. However, cross validation did not help here in arriving at the better model.
Applied Permutation importance to explain feature predictability score towards price
- Customers are highly likely to buy high price cars n California.
- When the car is automatic, the price is selling for high prices. However, car transmission being manual or other, also yield high sales but lower by 4 times than automatic transmission cars.
- Further, amonsgt the fuel types, gas fuelled cars predict high sale price of the cars.
- Other state that predict high sale prices are oregan
- Customers are highly likely to buy high price for sedan cars.