This school project is about predicting second hand cars using machine learning. This project needed skill in Machine learning (linear regression, NLP), data cleaning, and feature engineering. Files :
- Main files are at the roots of the repo
- EDA is in the
folder autopluspy
is a custom python library made for this project
- git clone the project
- create a virtualenv
virtualenv -p python3 venv
- Install dependencies
pip install -r requirements.txt
Put the initial dataset into
folder -
Run the jupyter notebook Runbook (available at the roots of the repo) to launch the whole system. Uncomment the last cell if you want to start the streamlit app
- Initial dataset
- Eventually new dataset
Spot and remove duplicated content (rows and columns)
Spot and remove missing values
Adapt data type (categorical, numerical, datetime, string)
Provide insight about unique value for each categorical value
Provide insight about each numerical value (.describe())
Get dummies of categorical variable in One Hot Encoder (update Data Dictionary)
Compute age of the car (Online - Model Year)
Count vectorizer on 'Options:'
Scrap AutoPlus and fuzzy match
Use Data Mapper
- Processed dataset
- Data Dictionary
- Dataset
- Data Dictionary
- Learn object:
- Original dataset in df / dataset.original /return df
- Train split /dataset.train_set/ return df X_train, y_train
- Test split /dataset.test_set /return df X_test, y_test
- Data Dictionary /dataset.data_dictionary /return df
- Analyze target variable distribution
- Normalization of numerical features
- Analyze features variance
- Multi collinearity handling
- Feature selection)
- CV
- Grid search and select best score based on CV results
- Results on test set
- [ ]Train on full learn_set
- SHAP/LIME/permutation_importance interpretation
- Performance metrics : MAPE
- Regression Model
- Std model
- Features needed for prediction with possible value
- Data Dictionary
- Features list needed for the prediction
Interaction :
- form
- display prediction and price tuning range
- how this car price is considering others cars price (good deal or not)
## User input : text
Model_year = st.text_input('Model_year', '2010')