Relation extraction (RE) identifies semantic relations between entity pairs in a text. The relation is defined between an entity pair consisting of subject entity () and object entity ( ). For example, in a sentence 'Kierkegaard was born to an affluent family in Copenhagenβ, the subject entity is Kierkegaard
and the object entity is Copenhagen
. The goal is then to pick an appropriate relationship between these two entities: . In order to evaluate whether a model correctly understands the relationships between entities, we include KLUE-RE in our benchmark. Since there is no large-scale RE benchmark publicly available in Korean, we collect and annotate our own dataset.
We formulate RE as a single sentence classification task. A model picks one of predefined relation types describing the relation between two entities within a given sentence. In other words, an RE model predicts an appropriate relation of entity pair " in a sentence , where " is the subject entity and is the object entity. We refer to " as a relation triplet. The entities are marked as corresponding spans in each sentence . There are 30 relation classes that consist of 18 person-related relations, 11 organization-related relations, and ". We evaluate a model using micro-F1 score, computed after excluding ", and area under the precision-recall curve (AUPRC) including all 30 classes.
The evaluation metrics for KLUE-RE are 1) micro F1 score on relation existing cases, and 2) area under the precision-recall curve (AUPRC) on all classes.
Micro F1 score is a geometric mean of micro-precision and micro-recall. It measures the F1-score of the aggregated contributions of all classes. It gives each sample the same importance, thus naturally weighting more on the majority class. We remove the dominant class ()" for this metric to not incentivize the model predicting negative class very well.
AUPRC is an averaged area under the precision-recall curves whose x-axis is recall and y-axis is the precision of all relation classes. It is a useful metric for this imbalanced data setting while rare positive examples are important.
- UBUNTU 18.04
- python==3.8
- pandas==1.1.5
- scikit-learn~=0.24.1
- transformers==4.10.0
The following specs were used to create the original solution.
- GPU(CUDA) : v100
To reproduct my submission without retraining, do the following steps:
- Installation
- Dataset Preparation
- Prepare Datasets
- Download Baseline Codes
- Train models
- Inference & make submission
- Ensemble
- Wandb graphs
All requirements should be detailed in requirements.txt. Using Anaconda is strongly recommended.
$ pip install -r requirements.txt
All CSV files are already in data directory.
After downloading and converting datasets and baseline codes, the data directory is structured as:
βββ code
β βββ __pycache__
β β βββ load_data.cpython-38.pyc
β βββ wandb_imgaes
β β βββ eval.png
β β βββ eval2.png
β β βββ train.png
β β βββ train2.png
β β βββ system.png
β β βββ system2.png
β β βββ system3.png
β βββ best_model
β βββ ensemble_csv
β βββ dict_label_to_num.pkl
β βββ dict_num_to_label.pkl
β βββ inference.py
β βββ load_data.py
β βββ bertmodel.py
β βββ logs
β βββ prediction
β β βββ sample_submission.csv
β βββ requirements.txt
β βββ results
β βββ train.py
βββ dataset
βββ test
β βββ test_data.csv
βββ train
βββ train.csv
To download baseline codes, run following command. The baseline codes will be located in /opt/ml/code
$ !wget https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000075/data/code.tar.gz
To download dataset, run following command. The dataset will be located in /opt/ml/dataset
$ !wget https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000075/data/dataset.tar.gz
To train models, run following commands.
$ python train.py
The expected training times are:
Model | GPUs | Batch Size | Training Epochs | Training Time |
---|---|---|---|---|
KoELECTRA | v100 | 16 | 4 | 1h 51m 29s |
XLM-RoBERTa-large | v100 | 27 | 4 | 2h 26m 52s |
LSTM-RoBERTa-large | v100 | 32 | 5 | 2h 25m 14s |
RoBERTa-large | v100 | 32 | 5 | 2h 5m 23s |
$ python inference.py
$python ensemble.py --path='./ensemble_csv'
- Eval Graphs
- Train Graphs
- System Graphs