/MVA21-ALTeGraD-Kaggle

[NLP and Graphs] Link prediction project on a network of research papers. Ranked #1/39 on Kaggle.

Primary LanguageJupyter NotebookMIT LicenseMIT

Link Prediction Challenge - Kaggle 2022

PyPI license PythonVersion

Overview

The challenge of predicting the presence of a link between two nodes in a network is known as link prediction. Here we will solve the problem of predicting if a research publication will cite another research paper. For that, we have access to a citation network that includes hundreds of thousands of research publications, as well as their abstracts and author lists.

The pipeline used to solve this problem is identical to that used to solve any classification problem; the goal is to learn the parameters of a classifier using edge information, and then use the classifier to predict whether two nodes are related by an edge or not. Our goal in this project is to transform the different types of data, i.e. abstracts, authors and citation graph to create a feature matrix that we can feed to the classifier that will tackle the link prediction problem. Our model performance will be evaluated with the log loss metric.

This model was created for the following Kaggle competition for the 2021/2022 Advanced learning for text and graph data course. It is ranked TOP 1 both on the public and private learderboard.

Team

The team OverTen is composed by Xavier Jiménez, Jean Quentin and Sacha Revol.

Submission

Best submission and results on the validation dataset can be reproduced using the best_submission.ipynb file.

Preprocessing

File Preprocessing.ipynb handles preprocessing for abstracts, authors and graph data.

Feature matrix creation & evaluation

File ALTEGRAD_project_v2.ipynb handles the different steps for matrix creation and evaluation (i.e. LR, RF, XGBoost, LGBM, CatBoost) File nn-classifier.ipynb implements the MLP classifier.

Author Graphs

Files weighted_co_authors_graph.py, utils.py and citation_graph.py handle authors Graph creation

Embeddings

Files *_embedding.py/ipynb handle abstract and graph node embeddings.

Hyperparameter Optimization

Files *_optimization.ipynb find best hyperparameters for a given model using HyperOpt package.