NLP Contribution Graph

This repo contains data and code to solve SemEval-2021 Task 11: NLP Contribution Graph
For detailed description of our method, please see the paper "UIUC_BioNLP at SemEval-2021 Task 11: A Cascade of Neural Models for Structuring Scholarly NLP Contributions".

Dependencies

This repo requires simpletransformers/ - the customized Simple Transformers package
- With customized model for subtask 1 to incorporate additional features
- Extended from Simple Transformers version 0.51.10, compatible with common usage
- Please first install the common package by running this code:
  - pip install simpletransformers==0.51.10
    find the installation directory, and replace the simpletransformers folder with this folder

Data

training_data/ - the training data merged with the trial data, with full annotation.
interim/ - intermediate data files converted from the training data
- all_sent.csv - contains all the sentences, each with its section header, positional features, paper topic and index, BIO tags, etc.
- pos_sent.csv - a subset of all_sent.csv consisting of all the positive sentences.
- triples.csv - contains each positive sentence with the predicates and terms in it, and the corresponding triples of different types.
test_data/ - the test data, with sentence and phrase annotation released.

Scripts

pre.py - preprocess training data, report potential errors, produce all_sent.csv and pos_sent.csv
ext.py - preprocess training data, and produce triples.csv
train_sent/ - Note that all scripts in this folder require the customized Simple Transformers package.
- A binary classifier is trained for subtask 1: contribution sentence classification
- A multi-class classifier is trained to classify sentences into information units
- A filename ended with '_ens' indicates that submodels are trained for ensembling.
train_ner/ - The models are trained for subtask 2: key phrase extraction.
- In the 'specific_bio' scheme, we use specific BIO tags to indicate phrase types, and train an NER model directly.
- In the 'simple_bio' scheme, we first identify the phrases, and then classify them into predicates and terms. The script for ensembling the models are also provided.
train_rel/ - For subtask 3: triple extraction, four models are trained to extract triples of type A, B, C and D respectively.
- For type A triples, two schemes are implemented: pairwise classification and direct triple classification. Only the latter scheme is used in evaluation phases.
predict1/ - scripts for Evaluation Phase 1 (end-to-end evaluation). Run the scripts in this order:
- pre.py - test data preprocessing
- sent_binary.py - contribution sentence classification
- sent_multi.py - information unit classification
- ner.py - phrases extraction. The 'specific-bio' scheme was used in this phase.
- predict_triples.py - extraction of type A, B, C and D triples, using different models.
- submit.ipynb - output formatting for submission
predict2/ - scripts for Evaluation Phase 2 Part 1: given the contribution sentence labels, do the rest.
- The naming of scripts basically follows that in predict1/.
- A filename ended with '-ens' indicates that an ensemble of submodels is used for prediction.
- In this phase and later, we used the 'simple-bio' scheme for phrase extraction.
predict3/ - scripts for Evaluation Phase 2 Part 2: given the labels of contribution sentences and phrases, do the rest.
- We copied the result of information unit classification in predict2/. Thus after running pre.py, we directly started from phrase classification.

Liu-Hy/nlp-contrib-graph

NLP Contribution Graph

Dependencies

Data

Scripts

Useful Links