ETHz CIL 2023 Collaborative Filtering
conda create --name cil python=3.9
conda activate cil
pip install -r requirements.txt
Create a directory /data
. Then put data_train.csv
and sampleSubmission.csv
inside.
python cross_validation.py config_submission_cv1.yaml
python cross_validation.py config_submission_cv2.yaml
python train.py config_submission_ensemble.yaml
The result will be in directory /output/submission
with name ensemble_GradientBoost_['bfm_op_rk32_iter1000_cv10', 'bfm_reg_rk16_iter1000_cv10']_10.csv
.
As running cross validation can cost a long time, we provide 10-fold full prediction results for BFM+reg+rank16+iters1000
and BFM+oprobit+rank32+iters1000
. You can find that in this link. Download them and move all txt files to the directory /output/data_ensemble
. Then submission results can be generated by using command:
python train.py config_submission_ensemble.yaml
experiment_args/model_name
: Model name.
experiment_args/generate_submissions
: True: Use the entire dataset to generate submissions; False: Split the dataset for validation.
python train.py config.yaml
Please note that when training an NCF model, an unfortunate crash might occur randomly due to either a zero-shaped tensor or segmentation fault. If that happens, please simply rerun the training.
experiment_args/model_name
: Model name.
experiment_args/generate_submissions
: False.
experiment_args/save_full_pred
: False.
ensemble_args/fold_number
: Fold number
python cross_validation.py config_cv.yaml
experiment_args/model_name
: Model name.
experiment_args/generate_submissions
: False.
experiment_args/save_full_pred
: False.
ensemble_args/fold_number
: Fold number
Modify the parameters in function grid_search
.
python grid_search.py config_cv.yaml
experiment_args/model_name
: Model name.
experiment_args/model_instance_name
: A prefix for saving prediction filenames.
experiment_args/save_full_pred
: True: The prediction values of fold x will be saved in path ensemble_args/data_ensemble + experiment_args/model_instance_name + "_fold_{fold number}_train/test".txt
. The train/test in the file name means the prediction results of ids provided in the data_train.csv
and sampleSubmission
respectively.
ensemble_args/fold_number
: Fold number
python cross_validation.py config_cv.yaml
experiment_args/model_name
: "ensemble".
ensemble_args/fold_number
: Fold number, should be the same as that used for cross validation results generation
ensemble_args/regressor
: "linear", "SGD", "BayesianRidge", "GradientBoost". Regressor type for blending.
ensemble_args/models
: List of model instances used for blending. The K-fold prediction results are save in format "[prefix]_fold_x_train/test.txt", enter prefix string here.
cv_args/weight_entries
: "True"cv_args/sample_proportion
: Proportion of training data sampled for each foldensemble_args/fold_number
(recommended) a large number of folds- (recommended) if possible, set the model you are running into a simple structure. e.g., low-rank
python train.py config.yaml
- MyFM library: https://github.com/tohtsky/myFM
- Surprise library: https://surpriselib.com/
- Microsoft Recommenders library: https://github.com/microsoft/recommenders/