CS4248 G17 NLP Final Project

By Kan Yip Keng, Lin Mei An, Yang Zi Yun, Yew Kai Zhe

About

Task Background (Information Extraction)

The SemEval-2021 Shared Task NLP CONTRIBUTION GRAPH (a.k.a. ‘the NCG task’) tasks participants to develop automated systems that structure contributions from NLP scholarly articles in English.

Subtask 1

Input: a research article in plaintext format
Output: a set of contributing sentences

Subtask 2

Input: a contributing sentence
Output: a set of scientific knowledge terms and predicate phrases

Guide

Dependencies

Additionally, install other dependencies in requirements.txt using conda or pip

How To Run

Make sure you have installed all the dependencies mentioned above
Clone or download this repository, then open a terminal and navigate to this folder
Train, test & evaluate the model by running python3 main.py {config} which:
1. Select the config-th configurations from config.py. Enter python3 main.py 0 to see all available configurations.
2. Load all data from the data/ folder
3. Randomly split the dataset into training & testing set
4. Train the model using the training set and store the model in the model file
5. Test the model against the testing set

Useful flags

These optional flags can be combined into one single command line:

Use -d [data_dir/] to specify which dataset folder to use, default: data/
Use -m [model_name] to specify the model filename to generate, default: model/
Use -s [summary_name] to specify the summary name, default will be auto-generated by wandb
Use --summary to enable summary mode
Use --train to train model only
Use --test to test model only

Examples

python3 main.py 1

python3 main.py 1 -d data-small/ --summary -s scibert

python3 main.py 2 --train

Project Structure

General

main.py - main runner file of the project
dataset.py - loads, pre-processes data and implements the NcgDataset class
model.py - loads, saves model and implements the NcgModel class
config.py - defines all hyperparameters
subtask1/ - implements the dataset, models and helper functions for subtask 1
subtask2/ - implements the dataset, models and helper functions for subtask 2
documentation/ - written reports

Data

data/ - contains 38 task folders
data-small/ - a subset of data/, contains 5 task folders
data-one/ - a subset of data/, contains 1 task folder

The data folders is organized as follows:

[task-name-folder]/                                # natural_language_inference, paraphrase_generation, question_answering, relation_extraction, etc
    ├── [article-counter-folder]/                  # ranges between 0 to 100 since we annotated varying numbers of articles per task
    │   ├── [article-name].pdf                     # scholarly article pdf
    │   ├── [article-name]-Grobid-out.txt          # plaintext output from the [Grobid parser](https://github.com/kermitt2/grobid)
    │   ├── [article-name]-Stanza-out.txt          # plaintext preprocessed output from [Stanza](https://github.com/stanfordnlp/stanza)
    │   ├── sentences.txt                          # annotated Contribution sentences in the file
    │   ├── entities.txt                           # annotated entities in the Contribution sentences
    │   └── info-units/                            # the folder containing information units in JSON format
    │   │   └── research-problem.json              # `research problem` mandatory information unit in json format
    │   │   └── model.json                         # `model` information unit in json format; in some articles it is called `approach`
    │   │   └── ...                                # there are 12 information units in all and each article may be annotated by 3 or 6
    │   └── triples/                               # the folder containing information unit triples one per line
    │   │   └── research-problem.txt               # `research problem` triples (one research problem statement per line)
    │   │   └── model.txt                          # `model` triples (one statement per line)
    │   │   └── ...                                # there are 12 information units in all and each article may be annotated by 3 or 6
    │   └── ...                                    # there are K articles annotated for each task, so this repeats for the remaining K-1 annotated articles
    └── ...                                        # if there are N task folders overall, then this repeats N-1 more times

Resources

Official Links

Official Website

CodaLab Portal

NCG Task Paper GitHub

Evaluation script

Participating Teams

ITNLP Paper GitHub
UIUC_BioNLP Paper GitHub
IITK Paper GitHub
ECNUICA Paper
INNOVATORS Paper GitHub
Duluth Paper GitHub
YNU-HPCC Paper GitHub

jetkan-yk/nlp-project