Synerise at ACM Twitter RecSys Challenge 2021

Implementation of our 2nd place solution to Twitter RecSys Challenge 2021. The goal of the competition was to predict user engagement with 1 billion tweets selected by Twitter. An additional challenge was the test phase - the models were evaluated in a very constrained test environment - just 1 CPU core, with no GPUs available, and a time limit of 24h for all predictions which gives about 6ms per single tweet prediction.

The challenge focuses on a real-world task of tweet engagement prediction in a dynamic environment. It considers predicting four different engagement types: Likes, Retweet, Quote, and Replies.

Approach

Getting Started

Register and download training and validation set from competition webstie
Setup a configuration file config.yaml:
- working_dir - path where all preprocess files will be saved
- recsys_data - path to directory with uncompressed training data parts
- validation_part - path to validation uncompressed part
- max_n_parts - maximum number of training parts that are taken into training. Limit it for speedup training.
- max_n_parts_in_memory - number of training parts that are loaded into memory at the same time. Limiting it allows to limit RAM usage
- authors_similarity_top_N - denotes maximum number of similar users to current one
- authors_similarity_threshold - users similarity threshold
- validation_percentage - percentage of validation set used for testing, the other part is used for finetuning
- num_validation_chunks - number of validation chunks
- sketch_width - sketch width for tweet sketch
- sketch_depth - sketch depth for tweet sketch
Finetune DistilBERT and precompute token sketches

    python bert_finetuning.py

BERT checkpoints will be saved periodically. You can run sketch computation on any chosen checkpoint:

Prepare sketches of tokens from checkpoint model that was trained with the above script. Change /distilbert_checkpoints/checkpoint-1000 to the most updated model path.

    python prepare_token_embeddings.py --checkpoint-path ./distilbert_checkpoints/checkpoint-1000

Apply EMDE to compute sketches

    python emde.py

Steps 2 and 3 can be run simultaneously

Preprocess dataset and compute interactions of users:

    python interactions_extraction.py

Train model

    python train.py

dungnt085/recsys-challenge-2021

Synerise at ACM Twitter RecSys Challenge 2021

Approach

Getting Started