/Car_Price_Prediction_Project

This repository is dedicated to developing a model for predicting car prices in Malaysia, utilizing various data science techniques and machine learning algorithms.

Primary LanguageJupyter Notebook

πŸš— Predicting Car Prices with Random Forest Regressor


πŸ“‹ Table of Contents

  1. Introduction
  2. Project Overview
  3. Data Preparation
  4. Feature Engineering
  5. Model Building
  6. Model Evaluation
  7. Additional Models Tested
  8. Model Saving & Deployment
  9. Fundamentals
  10. Advanced Concepts
  11. Results and Conclusions
  12. Next Steps
  13. Getting Started
  14. References

1. Introduction πŸš€

This project aims to predict car prices using machine learning techniques, focusing primarily on the Random Forest Regressor model. We also explore Multiple Linear Regression and Linear Regression models for comparison. Feature engineering was employed to create new variables that could enhance model performance and provide more accurate predictions.

2. Project Overview πŸ”

The project consists of the following key sections:

  • 🧹 Data Preparation: Cleaning and preprocessing the dataset.
  • πŸ”§ Feature Engineering: Creating new features to improve model accuracy.
  • πŸ€– Model Building: Constructing and tuning the Random Forest Regressor, with comparison to other models.
  • πŸ“Š Model Evaluation: Assessing model performance using metrics such as RΒ² Score, RMSE, and Cross-Validation.
  • πŸ’Ύ Model Saving & Deployment: Saving and deploying the trained model for future use.

πŸ”§ Technologies Used

Machine Learning with Python, Scikit-Learn, and Pickle

Python Logo Scikit-Learn Logo Pickle Logo

Python Scikit-Learn Pickle


3. Data Preparation πŸ› οΈ

The dataset used includes various features that impact car pricing. The key steps taken during data preparation include:

  • πŸ“ Columns:

    • car_brand
    • car_model
    • car_variant
    • car_year
    • car_engine
    • car_transmission
    • milage
    • accident
    • flood
    • color
    • purchase_date
    • sales_date
    • days_on_market (Engineered feature)
    • car_age_at_sale (Engineered feature)
    • price (Target variable)
  • Key Steps:

    • Handling missing values and outliers.
    • Encoding categorical variables.
    • Splitting the data into training and testing sets.

4. Feature Engineering πŸ› οΈ

To boost the model's predictive capability, two new features were engineered:

  • days_on_market: Represents the number of days a car was listed for sale.
  • car_age_at_sale: Represents the car's age at the time of sale.

These features are designed to capture additional dimensions that might affect the car's selling price.

5. Model Building 🧠

Random Forest Regressor 🌳

The Random Forest Regressor is the primary model used, integrated into a pipeline for streamlined processing.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define the transformer for feature processing
transformer = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='drop'
)

# Initialize and build the pipeline
rf_model = Pipeline([
    ('ColumnTransformer', transformer),
    ('Model', RandomForestRegressor())
])

6. Model Evaluation πŸ“

Models were evaluated using the following metrics:

  • RΒ² Score: Indicates how well the model explains the variance in the target variable.
  • RMSE (Root Mean Squared Error): Measures the average magnitude of prediction errors.
  • Cross-Validation: Assesses model performance on unseen data.

Final Evaluation Metrics for Random Forest Regressor:

  • Test RΒ²: 0.8551
  • Test RMSE: 11,448.49
  • Cross-Validated RMSE: 12,661.31
  • Cross-Validated RΒ²: 0.8081

7. Additional Models Tested πŸ§ͺ

For comparative analysis, the following models were also tested:

  • Multiple Linear Regression: Used to understand linear relationships between features and the target variable.
  • Linear Regression: Served as a baseline model for comparison with more complex models.
from sklearn.linear_model import LinearRegression

# Define the pipeline for Multiple Linear Regression
mlr_model = Pipeline([
    ('ColumnTransformer', transformer),
    ('Model', LinearRegression())
])

8. Model Saving & Deployment πŸ’Ύ

The trained Random Forest Regressor model is saved using Python's pickle module, allowing for easy future use and deployment.

import pickle

# Save the pipeline and model
with open('rf_model_pipeline.pkl', 'wb') as file:
    pickle.dump(rf_model, file)

# Load the saved model
with open('rf_model_pipeline.pkl', 'rb') as file:
    loaded_pipeline = pickle.load(file)

9. Fundamentals 🧩

Understanding Regression Models:

Regression models are used for predicting a continuous outcome variable based on one or more predictor variables. In this project, regression models like Random Forest and Linear Regression are utilized to predict car prices.

Feature Engineering:

Feature engineering involves creating new input features from existing ones to improve model performance. In this project, engineered features like days_on_market and car_age_at_sale help capture additional information that may influence car prices.

Model Evaluation Metrics:

  • RΒ² Score: A statistical measure that indicates how well the regression predictions approximate the real data points.
  • RMSE: A metric that measures the average magnitude of the error, giving an idea of the prediction accuracy of the regression model.

10. Advanced Concepts πŸš€

Random Forest Regressor:

A Random Forest Regressor is an ensemble learning method that builds multiple decision trees and merges their results to improve predictive accuracy and control overfitting. It’s robust against noise and provides better performance compared to individual decision trees.

Pipeline in Scikit-Learn:

Pipelines are used in Scikit-Learn to automate the workflow of machine learning models, allowing for seamless data preprocessing and model training steps. In this project, the pipeline incorporates data transformation steps and the Random Forest model, ensuring streamlined and repeatable processes.

Model Deployment with Pickle:

Pickle is a Python module used to serialize and deserialize Python objects. It allows saving the trained model, so it can be loaded and used for future predictions without retraining.


11. Results and Conclusions 🏁

The Random Forest Regressor demonstrated strong performance, making it an effective tool for predicting car prices. The comparative analysis with multiple linear regression and linear regression models provided valuable insights into the relative performance of different approaches.

12. Next Steps ⏭️

  • Deployment: Explore deployment options using platforms such as AWS or Azure.
  • Model Interpretability: Implement methods like SHAP or LIME to understand feature importance and make the model more interpretable.

13. Getting Started πŸ› οΈ

To run this project locally:

  1. Clone this repository.
  2. Install the required dependencies using pip install -r requirements.txt.
  3. Execute the provided Jupyter notebook or Python scripts to run the analysis.
  4. Use the saved model for predictions or further analysis.

14. References πŸ“š

ocs.io/en/latest/)