This repo contains data and code to solve SemEval-2021 Task 11: NLP Contribution Graph
For detailed description of our method, please see the paper "UIUC_BioNLP at SemEval-2021 Task 11: A Cascade of Neural Models for Structuring Scholarly NLP Contributions".
- This repo requires
simpletransformers/
- the customized Simple Transformers package- With customized model for subtask 1 to incorporate additional features
- Extended from Simple Transformers version 0.51.10, compatible with common usage
- Please first install the common package by running this code:
pip install simpletransformers==0.51.10
find the installation directory, and replace thesimpletransformers
folder with this folder
training_data/
- the training data merged with the trial data, with full annotation.interim/
- intermediate data files converted from the training data- all_sent.csv - contains all the sentences, each with its section header, positional features, paper topic and index, BIO tags, etc.
- pos_sent.csv - a subset of all_sent.csv consisting of all the positive sentences.
- triples.csv - contains each positive sentence with the predicates and terms in it, and the corresponding triples of different types.
test_data/
- the test data, with sentence and phrase annotation released.
-
pre.py - preprocess training data, report potential errors, produce all_sent.csv and pos_sent.csv
-
ext.py - preprocess training data, and produce triples.csv
-
train_sent/
- Note that all scripts in this folder require the customized Simple Transformers package.- A binary classifier is trained for subtask 1: contribution sentence classification
- A multi-class classifier is trained to classify sentences into information units
- A filename ended with '_ens' indicates that submodels are trained for ensembling.
-
train_ner/
- The models are trained for subtask 2: key phrase extraction.- In the 'specific_bio' scheme, we use specific BIO tags to indicate phrase types, and train an NER model directly.
- In the 'simple_bio' scheme, we first identify the phrases, and then classify them into predicates and terms. The script for ensembling the models are also provided.
-
train_rel/
- For subtask 3: triple extraction, four models are trained to extract triples of type A, B, C and D respectively.- For type A triples, two schemes are implemented: pairwise classification and direct triple classification. Only the latter scheme is used in evaluation phases.
-
predict1/
- scripts for Evaluation Phase 1 (end-to-end evaluation). Run the scripts in this order:- pre.py - test data preprocessing
- sent_binary.py - contribution sentence classification
- sent_multi.py - information unit classification
- ner.py - phrases extraction. The 'specific-bio' scheme was used in this phase.
- predict_triples.py - extraction of type A, B, C and D triples, using different models.
- submit.ipynb - output formatting for submission
-
predict2/
- scripts for Evaluation Phase 2 Part 1: given the contribution sentence labels, do the rest.- The naming of scripts basically follows that in predict1/.
- A filename ended with '-ens' indicates that an ensemble of submodels is used for prediction.
- In this phase and later, we used the 'simple-bio' scheme for phrase extraction.
-
predict3/
- scripts for Evaluation Phase 2 Part 2: given the labels of contribution sentences and phrases, do the rest.- We copied the result of information unit classification in predict2/. Thus after running pre.py, we directly started from phrase classification.
- Task description paper
- Official website of the task
- Training data and trial data release.