/ARQMath-eval

Evaluation of systems for the ARQMath competition

Primary LanguagePython

ARQMath-eval

This repository contains code, which you can use to evaluate your system runs from the ARQMath competitions.

Description

Tasks

This repository evaluates the performance of your information retrieval system on a number of tasks:

The main tasks are:

  • task1 – Use this task to evaluate your ARQMath task 1 system, and
  • task2 – Use this task to evaluate your ARQMath task 2 system.

Subsets

Each task comes with three subsets:

  • train – The training set, which you can use for supervised training of your system.
  • validation – The validation set, which you can use to compare the performance of your system with different parameters. The validation set is used to compute the leaderboards in this repository.
  • test – The test set, which you currently should not use at all. It will be used at the end to compare the systems, which performed best on the validation set.

The task1 and task2 tasks also come with the all subset, which contains all relevance judgements. Use these to evaluate a system that has not been trained using subsets of the task1 and task2 tasks.

The task1 and task2 tasks also come with a different subset split used by the MIRMU and MSM teams in the ARQMath-2 competition submissions. This split is also used in the pv211-utils library:

  • train-pv211-utils – The training set, which you can use for supervised training of your system.
  • validation-pv211-utils – The validation set, which you can use for hyperparameter optimization or model selection.

The training set is futher split into the smaller-train-pv211-utils and smaller-validation subsets in case you need two validation sets for e.g. hyperparameter optimization and model selection. If you don't use either hyperparameter optimization or model selection, you can use the bigger-train-pv211-utils subset, which combines the train-pv211-utils and validation-pv211-utils subsets.

  • test-pv211-utils – The test set, which you currently should only use for the final performance estimation of your system.

Examples

Using the train subset to train your supervised system

$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.22
$ python
>>> from arqmath_eval import get_topics, get_judged_documents, get_ndcg
>>>
>>> task = 'task1'
>>> subset = 'train'
>>> results = {}
>>> for topic in get_topics(task=task, subset=subset):
>>>     results[topic] = {}
>>>     for document in get_judged_documents(task=task, subset=subset, topic=topic):
>>>        similarity_score = compute_similarity_score(topic, document)
>>>        results[topic][document] = similarity_score
>>>
>>> get_ndcg(results, task='task1-votes', subset='train', topn=1000)
0.5876

Using the validation subset to compare various parameters of your system

$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.21
$ python
>>> from arqmath_eval import get_topics, get_judged_documents
>>>
>>> task = 'task1'
>>> subset = 'validation'
>>> results = {}
>>> for topic in get_topics(task=task, subset=subset):
>>>     results[topic] = {}
>>>     for document in get_judged_documents(task=task, subset=subset, topic=topic):
>>>        similarity_score = compute_similarity_score(topic, document)
>>>        results[topic][document] = similarity_score
>>>
>>> user = 'xnovot32'
>>> description = 'parameter1=value_parameter2=value'
>>> filename = '{}/{}/{}.tsv'.format(task, user, description)
>>> with open(filename, 'wt') as f:
>>>     for topic, documents in results.items():
>>>         top_documents = sorted(documents.items(), key=lambda x: x[1], reverse=True)[:1000]
>>>         for rank, (document, score) in enumerate(top_documents):
>>>             line = '{}\txxx\t{}\t{}\t{}\txxx'.format(topic, document, rank + 1, score)
>>>             print(line, file=f)
$ git add task1-votes/xnovot32/result.tsv  # track your new result with Git
$ python -m arqmath_eval.evaluate          # run the evaluation
$ git add -u                               # add the updated leaderboard to Git
$ git push                                 # publish your new result and the updated leaderboard

Using the all subset to compute the NDCG' score of an ARQMath submission

$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.22
$ python -m arqmath_eval.evaluate MIRMU-task1-Ensemble-auto-both-A.tsv all 2020
0.238, 95% CI: [0.198; 0.278]

Citing ARQMath-eval

Text

NOVOTNÝ, Vít, Petr SOJKA, Michal ŠTEFÁNIK and Dávid LUPTÁK. Three is Better than One: Ensembling Math Information Retrieval Systems. CEUR Workshop Proceedings. Thessaloniki, Greece: M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen., 2020, vol. 2020, No 2696, p. 1-30. ISSN 1613-0073.

BibTeX

@inproceedings{mir:mirmuARQMath2020,
  title = {{Three is Better than One}},
  author = {V\'{i}t Novotn\'{y} and Petr Sojka and Michal \v{S}tef\'{a}nik and D\'{a}vid Lupt\'{a}k},
  booktitle = {CEUR Workshop Proceedings: ARQMath task at CLEF conference},
  publisher = {CEUR-WS},
  address = {Thessaloniki, Greece},
  date = {22--25 September, 2020},
  year = 2020,
  volume = 2696,
  pages = {1--30},
  url = {http://ceur-ws.org/Vol-2696/paper_235.pdf},
}