This is the 3rd place solution source code to Kaggle's Riiid! Answer Correctness Prediction competition. For a brief write-up and comments please check the discussion topic on Kaggle.
The solution will be presented at the 35th AAAI Conference on Artificial Intelligence (2021).
- Threadripper 1950x + 256 GiB RAM + 6 x RTX 3090 GPUs computer
- Ryzen 9 3950x + 64 GiB RAM + 1 x RTX 3090 GPU computer
Clone this repo and create the input
directory:
git clone https://github.com/jamarju/riiid-acp-pub
mkdir input
Unzip the dataset into input
or just copy over the required files:
train.csv
lectures.csv
questions.csv
Alternatively, you can just create a symlink from your dataset path to input
:
ln -s /your/dataset/path input
Install conda env and run jupyter:
conda env create -f env/env.yaml
conda activate riiid-acp
jupyter notebook --ip 0.0.0.0 --no-browser --NotebookApp.iopub_msg_rate_limit=10000000000
Run 01_pre.ipynb
to preprocess data. A minimum 128 GiB RAM is required (a swapfile is required if your computer has less than 128 GiB of RAM). This will generate the following pkl files in input/
input/data_v210101b.pkl
input/meta_v210101b.pkl
Run 02_train.ipynb
to train the model. The default parameters will produce an AUROC score of 0.812 using 2.5% holdout validation users.
The script supports distributed training on multi-GPU setups. See the instructions at the beginning of the notebook for the exact steps.
Additionally more models can be trained and later ensembled changing the number of encoder/decoder layers, heads, transformer activation, dropout, T-Fixup initialization and optimizer without further changes to the code by simply changing main
's default parameters. Output:
models/best210105.pth
Run 03_pre_sub.ipynb
to prepare data for submission. This will cut down user's historic data to the last 500 interactions. Outputs:
input/data_500_last_interactions_v210101b.pkl
input/data_attempt_num_v210101b.npy
input/data_attempts_correct_v210101b.npy
Run 04_pre_validation_set.ipynb
to generate a validation split off of train.csv
in a format suitable for the inference script (similar to example_test.csv
). Outputs:
input/validation_x_0.025.csv
input/validation_y_0.025.csv
input/validation_submission_0.025.csv
Copy or hard-link the trained models and the following files into kaggle_dataset/root/resources
:
ln input/data_500_last_interactions_v210101b.pkl kaggle_dataset/root/resources
ln input/data_attempt_num_v210101b.npy kaggle_dataset/root/resources
ln input/data_attempts_correct_v210101b.npy kaggle_dataset/root/resources
ln input/meta_v210101b.pkl kaggle_dataset/root/resources
For convenience the following two pre-trained models are provided in kaggle/root/resources
:
210105_0.812154_gelu_e4d4_ep30.pth
210105_0.812534_relu_e3e3.pth
At this point, ls -l kaggle_dataset/root/resources
should look like this:
-rw-rw-r-- 1 javi javi 79921941 Jan 15 23:06 210105_0.812154_gelu_e4d4_ep30.pth
-rw-rw-r-- 1 javi javi 65210849 Jan 15 23:06 210105_0.812534_relu_e3e3.pth
-rw-rw-r-- 2 javi javi 6811424004 Jan 15 22:54 data_500_last_interactions_v210101b.pkl
-rw-rw-r-- 2 javi javi 6085350128 Jan 15 22:54 data_attempt_num_v210101b.npy
-rw-rw-r-- 2 javi javi 6085350128 Jan 15 22:54 data_attempts_correct_v210101b.npy
-rw-rw-r-- 2 javi javi 1953960 Jan 11 11:12 meta_v210101b.pkl
Build the resources dataset:
cd kaggle_dataset
make
Run 05_inference.ipynb
.
If you trained your own models, set the H1
and H2
dicts to the appropriate training hyperparams.
The default inference notebook will attempt to ensemble up to two models with dynamic fallback to single model inference in order to fulfill the allocated time budget (8.75h by default).
It should produce an AUROC=0.816 on the default 0.025 user holdout validation set. Sample output:
10000it [6:32:02, 2.35s/it, model 1=2482136, model 1+2=2482075, eta=6.581/8.726, auroc (pub)=0.816131, auroc (pvt)=0.816148]
The script will also produce a submission.csv
file with predictions in the same format as example_sample_submission.csv
.
Alternatively, you can convert the notebook into a raw .py script that can be launched from the command line instead of jupyter's ipython interpreter. The execution should run faster and use less RAM this way. See instructions in the notebook.
- Kaggle and Riiid for organizing this challenging competition
- @wuwenmin for his fast ROC AUC calculation routine