This challenge proposes to predict the 20 most likely Stack Overflow users to answer a question. In my understanding, our goal is to predict users for new questions, not suggesting users for questions that have been asked already.
In that sense, we will split the dataset into a train, validation and test dataset. The train set will be used to fit models, while we will run our users predictions for questions from the validation set. The test set will be left untouched to evaluate a final model.
This setting prevent us from using collaborative filtering method, because a new question wouldn't have answers nor users associated to it. This question would face the cold start problem, like a new movie in the netflix catalog.
Instead, we will leverage content-based approaches, by focusing on questions title and description. The core idea is to use different embedding methods (BERT embeddings and later a simpler TF-IDF embeddings) to generate a candidate pools of nearest neighbors for a new question. Then, we will fetch users having answered to these questions and rank them using their answer ratings and users attributes.
This repository contains the notebooks of different approaches to solve the challenge, along with an introductory exploration data analysis (EDA). The notebooks content is organised as follow:
- EDA
- Baseline 1: Question and Users Embeddings
⚠️ without using precompute embeddings, the notebook will take long to execute (~1h30) - Baseline 2: ANN & Candidate Users Ranking
⚠️ without using precompute embeddings, the notebook will take long to execute (~1h) - Baseline 3. TF-IDF Embeddings & Candidate Users Ranking 🟢 fast embeddings
The data folder contains the following subfolders:
inputs
for raw json inputs that you need to download and unzip from the original challenge repository.intermediary
for precompute embeddings that you can optionally download from my google drive for a significant compute speed-up.results
for results of different approaches.
Various utils to reduce the amount of code from notebook and offer consistent methods. It also contains the BERT embedder model used accross different approaches.
- python 3.7 and higher
Download this repository
git clone git@github.com:Vincent-Maladiere/Algolia-ML-challenge-solution.git
Create a python environment and source it
python -m venv venv && source venv/bin/activate
Create data folders
mkdir -p data/inputs data/intermediary data/results
Download input data and place the zip file in data/inputs
, then unzip it:
on OSX, run
cd data/inputs && unzip ml_challenge.zip && cd ../..
[Optional] download embeddings precompute data and place the 2 pickle files in data/intermediary
Install requirements
pip install -r requirements.txt
Finally, open a notebook by running:
jupyter notebook notebooks --log-level=CRITICAL&