- Introduction
- Project Overview
- Data Preparation
- Feature Engineering
- Model Building
- Model Evaluation
- Additional Models Tested
- Model Saving & Deployment
- Fundamentals
- Advanced Concepts
- Results and Conclusions
- Next Steps
- Getting Started
- References
This project aims to predict car prices using machine learning techniques, focusing primarily on the Random Forest Regressor model. We also explore Multiple Linear Regression and Linear Regression models for comparison. Feature engineering was employed to create new variables that could enhance model performance and provide more accurate predictions.
The project consists of the following key sections:
- π§Ή Data Preparation: Cleaning and preprocessing the dataset.
- π§ Feature Engineering: Creating new features to improve model accuracy.
- π€ Model Building: Constructing and tuning the Random Forest Regressor, with comparison to other models.
- π Model Evaluation: Assessing model performance using metrics such as RΒ² Score, RMSE, and Cross-Validation.
- πΎ Model Saving & Deployment: Saving and deploying the trained model for future use.
The dataset used includes various features that impact car pricing. The key steps taken during data preparation include:
-
π Columns:
car_brand
car_model
car_variant
car_year
car_engine
car_transmission
milage
accident
flood
color
purchase_date
sales_date
days_on_market
(Engineered feature)car_age_at_sale
(Engineered feature)price
(Target variable)
-
Key Steps:
- Handling missing values and outliers.
- Encoding categorical variables.
- Splitting the data into training and testing sets.
To boost the model's predictive capability, two new features were engineered:
days_on_market
: Represents the number of days a car was listed for sale.car_age_at_sale
: Represents the car's age at the time of sale.
These features are designed to capture additional dimensions that might affect the car's selling price.
The Random Forest Regressor is the primary model used, integrated into a pipeline for streamlined processing.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Define the transformer for feature processing
transformer = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
],
remainder='drop'
)
# Initialize and build the pipeline
rf_model = Pipeline([
('ColumnTransformer', transformer),
('Model', RandomForestRegressor())
])
Models were evaluated using the following metrics:
- RΒ² Score: Indicates how well the model explains the variance in the target variable.
- RMSE (Root Mean Squared Error): Measures the average magnitude of prediction errors.
- Cross-Validation: Assesses model performance on unseen data.
Final Evaluation Metrics for Random Forest Regressor:
- Test RΒ²: 0.8551
- Test RMSE: 11,448.49
- Cross-Validated RMSE: 12,661.31
- Cross-Validated RΒ²: 0.8081
For comparative analysis, the following models were also tested:
- Multiple Linear Regression: Used to understand linear relationships between features and the target variable.
- Linear Regression: Served as a baseline model for comparison with more complex models.
from sklearn.linear_model import LinearRegression
# Define the pipeline for Multiple Linear Regression
mlr_model = Pipeline([
('ColumnTransformer', transformer),
('Model', LinearRegression())
])
The trained Random Forest Regressor model is saved using Python's pickle
module, allowing for easy future use and deployment.
import pickle
# Save the pipeline and model
with open('rf_model_pipeline.pkl', 'wb') as file:
pickle.dump(rf_model, file)
# Load the saved model
with open('rf_model_pipeline.pkl', 'rb') as file:
loaded_pipeline = pickle.load(file)
Regression models are used for predicting a continuous outcome variable based on one or more predictor variables. In this project, regression models like Random Forest and Linear Regression are utilized to predict car prices.
Feature engineering involves creating new input features from existing ones to improve model performance. In this project, engineered features like days_on_market
and car_age_at_sale
help capture additional information that may influence car prices.
- RΒ² Score: A statistical measure that indicates how well the regression predictions approximate the real data points.
- RMSE: A metric that measures the average magnitude of the error, giving an idea of the prediction accuracy of the regression model.
A Random Forest Regressor is an ensemble learning method that builds multiple decision trees and merges their results to improve predictive accuracy and control overfitting. Itβs robust against noise and provides better performance compared to individual decision trees.
Pipelines are used in Scikit-Learn to automate the workflow of machine learning models, allowing for seamless data preprocessing and model training steps. In this project, the pipeline incorporates data transformation steps and the Random Forest model, ensuring streamlined and repeatable processes.
Pickle is a Python module used to serialize and deserialize Python objects. It allows saving the trained model, so it can be loaded and used for future predictions without retraining.
The Random Forest Regressor demonstrated strong performance, making it an effective tool for predicting car prices. The comparative analysis with multiple linear regression and linear regression models provided valuable insights into the relative performance of different approaches.
- Deployment: Explore deployment options using platforms such as AWS or Azure.
- Model Interpretability: Implement methods like SHAP or LIME to understand feature importance and make the model more interpretable.
To run this project locally:
- Clone this repository.
- Install the required dependencies using
pip install -r requirements.txt
. - Execute the provided Jupyter notebook or Python scripts to run the analysis.
- Use the saved model for predictions or further analysis.
- Scikit-learn Documentation
- Pandas Documentation
- Matplotlib Documentation
- [SHAP Documentation](https://shap.readthed
ocs.io/en/latest/)