This school project is about predicting second hand cars using machine learning. This project needed skill in Machine learning (linear regression, NLP), data cleaning, and feature engineering. Files :
- Main files are at the roots of the repo
- EDA is in the
notebook
folder autopluspy
is a custom python library made for this project
- git clone the project
- create a virtualenv
virtualenv -p python3 venv
- Install dependencies
pip install -r requirements.txt
-
Put the initial dataset into
/data
folder -
Run the jupyter notebook Runbook (available at the roots of the repo) to launch the whole system. Uncomment the last cell if you want to start the streamlit app
Input:
- Initial dataset
- Eventually new dataset
Process:
-
Spot and remove duplicated content (rows and columns)
-
Spot and remove missing values
-
Adapt data type (categorical, numerical, datetime, string)
-
Provide insight about unique value for each categorical value
-
Provide insight about each numerical value (.describe())
-
Get dummies of categorical variable in One Hot Encoder (update Data Dictionary)
-
Compute age of the car (Online - Model Year)
-
Count vectorizer on 'Options:'
-
Scrap AutoPlus and fuzzy match
-
Use Data Mapper
Output
- Processed dataset
- Data Dictionary
Input:
- Dataset
- Data Dictionary
process:
- Learn object:
- Original dataset in df / dataset.original /return df
- Train split /dataset.train_set/ return df X_train, y_train
- Test split /dataset.test_set /return df X_test, y_test
- Data Dictionary /dataset.data_dictionary /return df
- Analyze target variable distribution
- Normalization of numerical features
- Analyze features variance
- Multi collinearity handling https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-multicollinear-py
- Feature selection)
- CV
- Grid search and select best score based on CV results
- Results on test set
- [ ]Train on full learn_set
- SHAP/LIME/permutation_importance interpretation
- Performance metrics : MAPE
Output:
- Regression Model
- Std model
- Features needed for prediction with possible value
Input:
- Data Dictionary
- Features list needed for the prediction
Interaction :
- form
- display prediction and price tuning range
- how this car price is considering others cars price (good deal or not)
## User input : text
Model_year = st.text_input('Model_year', '2010')