science4cast: A Jupyter Notebook repository from nicola-decao

Mario Krenn, Michael Kopp, David Kreil, Rose Yu, Moritz Neun, Christian Eichenberger, Markus Spanring, Henry Martin, Dirk Geschke, Daniel Springer, Pedro Herruzo, Marvin McCutchan, Alina Mihai, Toma Furdui, Gabi Fratica, Miriam Vázquez, Aleksandra Gruca, Johannes Brandstetter, Sepp Hochreiter

An official competition within the 2021 IEEE BigData Cup Challenges.

Introduction
The Task
Files and Datasets
Prizes
Submissions
Competition Timeline
Questions, Suggestions, Issues

1. Introduction

The corpus of scientific literature grows at an ever-increasing speed. Specifically, in the field of Artificial Intelligence (AI) and Machine Learning (ML), the number of papers every month grows exponentially with a doubling rate of roughly 23 months.

Consequently, researchers have to specialize in narrow subdisciplines, making it challenging to uncover scientific connections beyond their own area of research. A tool that could predict and suggest meaningful, personalized research ideas that transcend personal focus bubbles would open new avenues of research that would otherwise remain untravelled.

Our competition directly addresses this challenge: We created an evolving semantic network characterizing the content and evolution of the scientific literature in AI since 1994. The network contains 64,000 nodes, each representing an AI concept. The competition's goal is to predict future states of the exponentially growing semantic network to create models capturing the evolution of scientific concepts in the field.

Moreover, the compiled unique dataset will be instrumental in the pursuit of a wide range of exciting questions in the area of ML for Science of Science -- a recently established research field in the intersection of computational sociology, network science and big data science (see a recent review and book), sometimes called Metaknowledge research. Those questions include end-to-end trained concept discovery, predictions of concept emergence, predictions of interdisciplinary interactions, and suggestions of personalized research ideas. Solutions to our current competition and the existence of this extensive dataset will set us on the way to answer these vital questions.

Related tasks

Evolving knowledge networks like these are a common research topic in the Science of Science. Specifically, related semantic networks have been built in other disciplines in the natural sciences. Examples involve biochemistry, where no machine learning has been applied, and quantum physics with a much smaller semantic network (our network has 10 times more nodes and 50 times more edges and grows significantly faster). Our dataset provides an order of magnitude larger network.

Useful References

2. The Task

The main competition consists of predicting new links in the semantic network. We provide the semantic network from 1994-2017, with a discretization of days (which represents the publication date of the underlying papers).

Therefore, we provide approximately 8,400 snapshots of the growing semantic network - one snapshot for each day from the beginning of 1994 to the end of 2017, and participants are welcome to use more coarse-grained snapshots. The evolution shows how the links between 64,000 nodes are drawn. The precise goal of the task is to predict the future links formed between 2017-2020 in the semantic network, which do not exist yet in 2017. Equivalently, this task asks for the prediction of which scientific pairs of concepts will be investigated by scientists over three years.

Technical Formulation of the Task

In the competition you get

full_dynamic_graph_sparse: a dynamic graph (list of edges and their creation date) until a time t1.
unconnected_vertex_pairs: a list of 1,000,000 vertex pairs that are unconnected by time t1.

Your task in the competition is to predict which edges of unconnected_vertex_pairs will form until a time t2. Specifically, you sort the list of potential edges in unconnected_vertex_pairs from most likely to most unlikely. The result will be computed via the AUC of the ROC curve. See more details in the tutorial.

The Evaluation Metric

For the evaluation, we use a subset of all 57,000 vertices with a nonzero degree per the end of 2017. We define the set K of vertex pairs that are not connected yet by an edge at the end of 2017 (in the extreme case, K contains roughly 3.2 billion vertex pairs, i.e. possible edges. In our case, K contains half a million vertex pairs). Every k in K will either be connected or not connected by 2020. The goal is to predict whether the two vertices will be connected or not.

For evaluating the model, we use the ROC curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate at various threshold settings. Our evaluation metric is the commonly used metric Area under the Curve (AUC) of the ROC curve. One advantage of AUC over mean-square-error (MSE) is its independence of the data distribution. Specifically, in our case, where the two classes are highly asymmetrically distributed (with only about 1-3% of newly connected edges), and the distribution changing over time, the AUC provides a meaningful and operational interpretation. For perfect predictions, AUC=1, while random predictions give AUC=0.5. Operationally, it gives the percentage that a random true element is higher ranked than a random false one.

We provide a baseline that is trained on the mean-square-error of the predictions and evaluated on AUC. The participants will provide a sorted list of all elements in K, which maximizes the AUC. A simple solution could sort the elements by the model's probability of an edge formation. We note explicitly that other ways of training are allowed and appreciated, for instance, by exploiting direct end-to-end training on the AUC metric.

3. Files and Datasets

Source files: /Competition/

Evaluate_Model.py: Evaluating the models
SimpleModelFull.py: Baseline model

Detailed tutorial: /Tutorial/tutorial.ipynb

How to read and visualize data
How to run a baseline model
How to create predictions for validation and competition data

data files at IARAI website: Science4Cast_data.zip contains the following three files:

TrainSet2014_3.pkl: Semantic network until 2014, for predicting 2017
TrainSet2014_3_solution.pkl: which edges are connected in 2017
CompetitionSet2017_3.pkl: Semantic network until 2017, used for evaluation

Copy those date files directly into the directory of the source files and tutorial.

4. Prizes

The competition offers the following prizes, for the top three winners (participants/teams):

1st Prize: 8,000 EUR
2nd Prize: 6,000 EUR
3rd Prize: 2,000 EUR

In addition, special prizes will be awarded to outstanding or creative solutions, should they exist. We will also potentially include a fellowship position at Institute of Advanced Research in Artificial Intelligence, Vienna, Austria.

5. Submissions

Participants can upload their predictions on the test dataset (CompetitionSet2017_3.pkl) to the leaderboard of the competitions until the submission deadline at the IARAI website. The file format is JSON, see details in the tutorial

Besides the submissions to the leaderboard (find it at the IARAI website), submission of working code, learned parameters, and a short scientific paper (4-6 pages) to be published in the IEEE BigData workshop with a sufficiently detailed description of the approach used is required to be awarded a prize. The scientific quality of the submitted paper will be verified by the competition committee.

After the competition, we plan to write a perspective/summary paper and invite all participants to contribute.

6. Competition Timeline

All times and dates are Anywhere on Earth (UTC -12).

Data Release: 25. August 2021
Competition ends (submission deadline): 3. November 2021
Abstract submission deadline: 17. November 2021
Announcement of the winners: 2. December 2021
IEEE BigData 2021: 15.-18. December 2021

7. Questions, Suggestions, Issues

Please raise an GitHub issue if you have questions or problems, or send an e-Mail to Mario Krenn.

nicola-decao/science4cast