ACM RecSys Challenge 2024 [FeatureSalad Team]

About the challenge

From the Challenge website: Ekstra Bladet.

The dataset for the ACM RecSys Challenge 2024, named EB-NeRD, is a large-scale Danish dataset created by Ekstra Bladet to support advancements and benchmarking in news recommendation research. EB-NeRD includes data from over 2.3 million users and more than 380 million impression logs collected from Ekstra Bladet. The dataset was compiled by recording behavior logs from active users during a six-week period from April 27 to June 8, 2023. This specific timeframe was chosen to avoid major events, such as holidays or elections, that could result in atypical user behavior on Ekstra Bladet. To protect user privacy, anonymization was implemented using one-time salt mapping. In addition to user interaction data, the dataset includes news articles published by Ekstra Bladet, enriched with textual context features such as titles, abstracts, bodies, and categories. Moreover, the dataset provides features generated by proprietary models, including topics, named entity recognition (NER), and article embeddings. Participants were asked to predict which article a user will click on from a list of articles that were seen during a specific impression. In particular, the challenge's objective is to estimate the likelihood of a user clicking on each article by evaluating the compatibility between the article's content and the user's preferences. The articles are ranked based on these likelihood scores, and the precision of these rankings is measured against the actual selections made by users. preserving user privacy.

Team members

We participated in the challenge as FeatureSalad, a team of 6 MSc students from Politecnico di Milano:

We worked under the supervision of:

Maurizio Ferrari Dacrema (Assistant Professor)
Andrea Pisani (PhD student)

Reproduce best submission

Download the dataset

The first step is to download the dataset .parquet files and place it in the folder dataset/. In the end you should have a structure as follow:

├── Ekstra_Bladet_contrastive_vector
│   └── contrastive_vector.parquet
...
├── ebnerd_demo
│   ├── articles.parquet
│   ├── train
│   │   ├── behaviors.parquet
│   │   └── history.parquet
│   └── validation
│       ├── behaviors.parquet
│       └── history.parquet
...

Preprocessing

To create the preprocessing:

sh ~/RecSysChallenge2024/src/polimi/scripts/run_all_preprocessing.sh

Now in ~/RecSysChallenge2024/preprocessing/train_ds.parquet we have our complete preprocessed dataset ready to be used.

Stacking

Our submission is based on stacking. To do so, we need to create the dataframe with the "first-level" model prediction. The procedure is:

Train level one models on train set
Run inference over the validation set for these models
Train level two model using the validation set augmented with models one predictions.

Then, to create the train dataset for the testset:

Train level one models on train + validation set
Run inference over the testset for these models

Then we can generate testset predictions using previously trained level two models.

Return as final prediction the average predictions of all level two models.

Model	Type	Level 1	Level 2
Catboost	Classifier	*	*
Catboost	Ranker	*
LightGBM	Classifier	*	*
LightGBM	Ranker	*
MLP	Classifier	*
GANDALF	Classifier	*
DEEP & CROSS	Classifier	*
WIDE & DEEP	Classifier	*

Training

The following table shows the hyperparameters used for each model and each preprocessing. Neural models have not been trained on the second version of the preprocessing, due to limit of time.

Model	Type	Configuration Path
Catboost	Classifier	`~/RecSysChallenge2024/configuration_files/catboost_classifier_recsys_best.json`
Catboost	Ranker	`~/RecSysChallenge2024/configuration_files/catboost_ranker_new_noK_95.json`
LightGBM	Classifier	`~/RecSysChallenge2024/configuration_files/lightgbm_cls_recsys_trial_107.json`
LightGBM	Ranker	`~/RecSysChallenge2024/configuration_files/lightgbm_ranker_recsys_trial_219.json`
MLP	Classifier	`~/RecSysChallenge2024/configuration_files/mlp_tuning_new_trial_208_early_stopped_long_with_pre.json`
GANDALF	Classifier	`~/RecSysChallenge2024/configuration_files/gandalf_tuning_new_trial_130_early_stopped_with_pre.json`
DEEP & CROSS	Classifier	`~/RecSysChallenge2024/configuration_files/deep_cross_tuning_new_trial_67_early_stopped_with_pre.json`
WIDE & DEEP	Classifier	`~/RecSysChallenge2024/configuration_files/wide_deep_new_trial_72_early_stopped_with_pre.json`

Note that to train each of this model the path of the desired preprocessing version is required, along with the correct configuration file path. Pass them as command line arguments.

Moreover, all models except the ranker have been trained on a subsample of the dataset. To create a subsample of the preprocessing you can run the following script:

python ~/RecSysChallenge2024/src/polimi/preprocessing_pipelines/subsample_train.py \
     -output_dir ~/RecSysChallenge2024/experiments/ \
     -dataset_dir ~/RecSysChallenge2024/preprocessing/... \
     -original_path ~/RecSysChallenge2024/dataset/ebnerd_small/train/behaviors.parquet

where -dataset_dir contains the path of the directory with the train_ds.parquet preprocessing file.

Catboost classifier

python ~/RecSysChallenge2024/src/polimi/scripts/catboost_training.py \
    -output_dir  ~/RecSysChallenge2024/models \
    -dataset_path ~/RecSysChallenge2024/preprocessing/... \
    -catboost_params_file ~/RecSysChallenge2024/configuration_files/catboost_classifier_recsys_best.json \
    -catboost_verbosity 20 \
    -model_name catboost_classifier

LGBM classifier

python ~/RecSysChallenge2024/src/polimi/scripts/lightgbm_training.py \
    -output_dir ~/RecSysChallenge2024/models \
    -dataset_path ~/RecSysChallenge2024/preprocessing/... \
    -lgbm_params_file ~/RecSysChallenge2024/configuration_files/...

Neural Networks (MLP, GANDALF, DEEP & CROSS and WIDE & DEEP)

python ~/RecSysChallenge2024/src/polimi/scripts/nn_training.py \
    -output_dir ~/RecSysChallenge2024/models \
    -dataset_path ~/RecSysChallenge2024/preprocessing/... \
    -params_file ~/RecSysChallenge2024/configuration_files/... \
    -model_name ...

Ranker models

In our solution Catboost ranker has been trained in batches due to memory limitations, an example of the used procedure can be found ~/RecSysChallenge2024/src/polimi/scripts/catboost_ranker_batch_training inside that folder there's a file named _procedure_batch_training.txt that explains the procedure.

If there are no memory constraint, you can train LightGBM and Catboost ranker by using the same script for the classifier described above by passing the argument --ranker.

Inference

For Catboost/LightGBM, you can use the following script

python ~/RecSysChallenge2024/src/polimi/scripts/inference.py \
   -output_dir ~/RecSysChallenge2024/inference \
   -dataset_path ~/RecSysChallenge2024/preprocessing/... \
   -model_path ~/RecSysChallenge2024/models/{model_name}/model.joblib \
   -behaviors_path ~/RecSysChallenge2024/dataset/ebnerd_testset/test/behaviors.parquet \
   -batch_size 1000000 \
   --submit

Otherwise, for NN models

python ~/RecSysChallenge2024/src/polimi/scripts/nn_inference_batched.py \
    -output_dir ~/RecSysChallenge2024/inference \
    -dataset_path ~/RecSysChallenge2024/preprocessing/... \
    -model_path ~/RecSysChallenge2024/models/{model_name} \
    -params_file ~/RecSysChallenge2024/configuration_files/... \
    -batch_size 5096 \
    -behaviors_path ~/RecSysChallenge2024/dataset/ebnerd_testset/test/behaviors.parquet \
    --submit

In case of inference over the validation set pass the flag –-eval otherwise use the flag –-submit.

Level 2 train dataset

python ~/RecSysChallenge2024/src/polimi/scripts/preprocessing_level_2.py \
    -features_dir ~/RecSysChallenge2024/experiments \
    -model_json ~/RecSysChallenge2024/configuration_files/... \
    -output_dir ~/RecSysChallenge2024/stacking \
    –train

Where -features_dir is the directory path that contain the features of level 1 models. Moreover, you can remove the flag --train in case you are build the level 2 train dataset for the testset.

Regarding configuration files, those are the ones beign used for the final submission:

Model	Type	Configuration Path
CatBoost	Classifier	`~/RecSysChallenge2024/configuration_files/stacking_catboost_cls_features_double_iterations.json`
LightGBM	Classifier	`~/RecSysChallenge2024/configuration_files/stacking_lgbm_cls_features.json`

Last step

Finally, the last step is to do the average of the two models.

 python ~/RecSysChallenge2024/src/polimi/scripts/generate_hybrid_submission.py \
     -prediction_1 ~/RecSysChallenge2024/inference/Inference_stacking_Catboost/prediction_ds.parquet \
     -prediction_2 ~/RecSysChallenge2024/inference/Inference_stacking_LightGBM/prediction_ds.parquet \
     -original_path ~/RecSysChallenge2024/dataset/ebnerd_testset/test/behaviors.parquet \
     -output_dir ~/RecSysChallenge2024/inference/Inference_stacking_Hybrid

recsyspolimi/recsys-challenge-2024-ekstrabladet