This is NCKU DSAI HW4. In this readme file, I will demonstrate how to run our code and briefly explain our idea in this competition.
!!!The preprocessed data is not uploaded due to file size limit. The TA of the class can get our full source code in the below link. Our best result on kaggle competition is not our final submit version because the model training time is above one hour.
pip install -r requirements.txt
cd code
python main.py
The competition we attend is predict future sales on kaggle.Link The goal in this competition is obvious. To predict the sales value in the future. RMSE as evaluation and ranking benchmark.
https://drive.google.com/file/d/1AEp-gv1t2wY_fIxtoClNG6nj-5zhr_0U/view?usp=sharing
In this section I will briefly describe how we design our code. Our code can break down in following parts.
- Data Cleansing
- Preprocessing
- Feature Engineering
- Model Fitting
- Result
In this section, we clear some outlier or dupicate data. Specific mechanism can be find in our slides.
- Merge duplicate shop
- Remove shop didn't appear in test set
- Remve outliers
We observed that the test set data frame is quite different from the train data frame. We did some reorganise to the train data frame.Specific mechanism can be find in our slides.
In this section, we are trying to find as many feature as possible. We did some research on kaggle and add some new feature. The following list is the feature that we add in our data frame.
- Item name grouping
- Merge music artist / first word of item
- Item name length
- Time feature
- Price feature
- Item category
- Shop city
- Number of unique item features
- Percentage change in an aggregate feature
Specific mechanism can be find in our slides.
We try the following models.
- Light GBM - final model
- XGBoost - bad performance, time limit exceed
- Deep nueral network - GPU environment not permitted
We use Light GBM as our final model.Specific mechanism can be find in our slides.
https://docs.google.com/presentation/d/1um4vZ-wj9UeHzZp4CjvPU9EnJjlhM15c9_35MBN1Vqc/edit?usp=sharing
The following picture is our best score.