- Python3
- Sklearn
- Matplotlib
- Pandas
- Catboost, xgboost, lightgbm, sklearn
- tqdm
- model.py : The python script for the catboost model generates the result csv.
- finalsub.csv : The submission file.
- main.ipynb : notebook for model.py
- download.py : Download the data
- zipfilee_FILES/ds_data : train and test csv files to be stored here.
- EDA.ipynb : Exploratory Data Analysis of the data.
- oversampling : Experiments on Oversampling.
- experimental_model_nb dir : Experiments on data
- catboost_info : model storage, model graphs(tensorboard files)
- Images Images from eda for readme file.
- Run on terminal/bash
python3 downloads.py
. It downloads the data. Extract and move thetest.csv
andtrain.csv
intozipfilee_FILES/ds_data
. - Run the
model.py
bypython3 model.py
. It will show the results. Uncomment the last lines to generate theresult.csv
for test data. - Same with rest .ipynb files. Using
jupyter notebook --allow-root
- Feature importance
- Box plots to see feature variance
- Distribution of target values
B. 20000 iterations
with learning_rate=0.3 (best)
Briefly describe the conceptual approach you chose! What are the trade-offs?
- First starting with Exploratory Data Analysis : exploring both the test and train files.Getting the most important features out of
55 feature set
, and exploring them thoroughly. Veiwing the number of unique values to get how to deal with themissing values
. - Training with different models and using
ensemble
with various algorithms. This dir has all the experimental results. Models that I used areMLPClassifier
,KNeighborsClassifier
,SVC
,GaussianProcessClassifier
,DecisionTreeClassifier
,AdaBoostClassifier
,ExtraTreesClassifier
. - For all ensemble models and singely each algorithm gave
0.96
accuracy which is absurd due to Imbalance Class issue i.e class 1 in the target value is roughly 0.3% of whole train data. - Then I tried
oversampling
, andundersampling
andensemble sampling
, which resulted in no imbalance class issue, but now the models had bad accuracy including the ensemble ones. - Next I tried Catboost and lightGBM and xgboost, from which Catboost had the best accuracy.
- The only trade-off is computational time, which is 10 mins on P100 GPU.
What's the model performance? What is the complexity? Where are the bottlenecks?
- Model is
CATBOOST
, a fast, scalable and high performance Gradient Boosting on Decision Trees. As it has GPU support, training was super fast too. - Metrics
- On a standard laptop training for 20k iterations on
CPU
will take 10-15 mins with 95% accuracy. - On
GPU
training for 20k iterations takes 4-6 mins. Prediction time is <~0.1 sec on the test set. - Bottleneck : Transfering the same model, if we train a small dataset with the same parameters, performance will be slower than lightgbm and xgboost.
If you had more time, what improvements would you make, and in what order of priority?
- PCA visualization.
- Try voting classifier of deep catboost, lightgbm and xgboost(which will be computational expensive)
- Write unit testing functions for the data preprocessing part.