Data Mining G07 Final Project (Kaggle competition link: https://www.kaggle.com/c/yelp-recruiting)
- numpy == 1.14.3
- pandas == 0.23.4
- sklearn == 0.20.0
- keras == 2.1.6
- tensorflow == 1.9.0
Please first download the dataset from Kaggle to the directory 'data/'. (https://www.kaggle.com/c/yelp-recruiting/data)
$ tree
.
├── data (need to download first)
│ ├── yelp_test_set
│ │ ├── yelp_test_set_business.json
│ │ ├── yelp_test_set_checkin.json
│ │ ├── yelp_test_set_review.json
│ │ └── yelp_test_set_user.json
│ ├── yelp_training_set
│ │ ├── yelp_training_set_business.json
│ │ ├── yelp_training_set_checkin.json
│ │ ├── yelp_training_set_review.json
│ │ └── yelp_training_set_user.json
│ └── sample_submission.csv
├── feature
│ ├── test_X.pkl
│ ├── X.pkl
│ └── Y.pkl
├── model
│ └── DNN.h5
├── submit
│ ├── DNN.csv
│ ├── ensemble.csv
│ ├── LR.csv
│ ├── RFR.csv
│ └── SVR.csv
├── data_preprocessing.py
├── DNN.py
├── ensemble.py
├── LR.py
├── README.md
├── Report.pdf
├── RFR.py
└── SVR.py
6 directories, 26 files
After downloading the dataset, run:
$ python3 data_preprocessing.py
Check the saved file 'test_X.pkl', 'X.pkl', 'Y.pkl' under the directory 'feature/'.
Using Linear Regression (LR), run:
$ python3 LR.py
Check the saved file 'LR.csv' under the directory 'submit/'.
Using Deep Neural Network (DNN), run:
$ python3 DNN.py
Check the saved file 'DNN.csv' under the directory 'submit/'.
Using Support Vector Regressor (SVR), run:
$ python3 SVR.py
Check the saved file 'SVR.csv' under the directory 'submit/'.
Using Random Forest Regressor (RFR), run:
$ python3 RFR.py
Check the saved file 'RFR.csv' under the directory 'submit/'.
Before using ensemble, please first check that 'SVR.csv', 'DNN.csv' and 'RFR.csv' are under the directory 'submit/'. Then run:
$ python3 ensemble.py
Check the saved file 'ensemble.csv' under the directory 'submit/'.
Method | Private Score | Public Score | Ranking |
---|---|---|---|
LR | 0.50826 | 0.50889 | 94/350 |
DNN | 0.48545 | 0.48405 | 62/350 |
SVR | 0.50575 | 0.50356 | 87/350 |
RFR | 0.52554 | 0.52579 | 120/350 |
ensemble | 0.48340 | 0.48241 | 59/350 |
Evaluation Metric: Root Mean Squared Logarithmic Error ("RMSLE")