/OpenNRE

An Open-Source Package for Neural Relation Extraction (NRE) implemented in TensorFlow

Primary LanguagePythonMIT LicenseMIT

OpenNRE

*** UPDATE ***

We are now developing a new version of OpenNRE including sentence-level, bag-level(distant supervision) and few-shot RE on branch nrekit. Refer to the nrekit branch if you are interested.

An open-source framework for neural relation extraction.

Contributed by Tianyu Gao, Xu Han, Shulian Cao, Lumin Tang, Yankai Lin, Zhiyuan Liu

If you want to learn more about neural relation extraction, visit another project of ours NREPapers.

BIG UPDATE: The project has been completely reconstructed and is faster, more extendable and the codes are easier to read and use now. If you need get to the old version, please refer to branch old_version.

New features:

  • JSON data support.
  • Multi GPU training.
  • Validating while training.

Overview

It is a TensorFlow-based framwork for easily building relation extraction (RE) models. We divide the pipeline of relation extraction into four parts, which are embedding, encoder, selector (for distant supervision) and classifier. For each part we have implemented several methods.

  • Embedding
    • Word embedding
    • Position embedding
  • Encoder
    • PCNN
    • CNN
    • RNN
    • Bidirection RNN
  • Selector
    • Attention
    • Maximum
    • Average
  • Classifier
    • Softmax Loss Function
    • Output

All those methods could be combined freely.

We also provide training and testing framework for sentence-level RE and bag-level RE. A plotting tool is also in the package.

This project is under MIT license.

Requirements

pip install -r requirements.txt

  • Python (>=2.7)
  • Numpy (>=1.13.3)
  • TensorFlow (>=1.4.1)
    • CUDA (>=8.0) if you are using gpu
  • Matplotlib (>=2.0.0)
  • scikit-learn (>=0.18)

Data Format

For training and testing, you should provide four JSON files including training data, testing data, word embedding data and relation-ID mapping data.

Training Data & Testing Data

Training data file and testing data file, containing sentences and their corresponding entity pairs and relations, should be in the following format

[
    {
        'sentence': 'Bill Gates is the founder of Microsoft .',
        'head': {'word': 'Bill Gates', 'id': 'm.03_3d', ...(other information)},
        'tail': {'word': 'Microsoft', 'id': 'm.07dfk', ...(other information)},
        'relation': 'founder'
    },
    ...
]

IMPORTANT: In the sentence part, words and punctuations should be separated by blank spaces.

Word Embedding Data

Word embedding data is used to initialize word embedding in the networks, and should be in the following format

[
    {'word': 'the', 'vec': [0.418, 0.24968, ...]},
    {'word': ',', 'vec': [0.013441, 0.23682, ...]},
    ...
]

Relation-ID Mapping Data

This file indicates corresponding IDs for relations to make sure during each training and testing period, the same ID means the same relation. Its format is as follows

{
    'NA': 0,
    'relation_1': 1,
    'relation_2': 2,
    ...
}

IMPORTANT: Make sure the ID of NA is always 0.

Provided Data

NYT10 Dataset

NYT10 is a distantly supervised dataset originally released by the paper "Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text.". Here is the download link for the original data.

We've provided a toolkit to convert the original NYT10 data into JSON format that OpenNRE could use. You could download the original data + toolkit from Google Drive or Tsinghua Cloud. Further instructions are included in the toolkit.

Installation and Quick Start

  1. Install all the required package.

  2. Clone the OpenNRE repository:

git clone https://github.com/thunlp/OpenNRE.git

Since there are too many history commits of this project and the .git folder is too large, you could use the following command to download only the latest commit:

git clone https://github.com/thunlp/OpenNRE.git --depth 1
  1. Make data folder in the following structure
OpenNRE
|-- ... 
|-- data
    |
    |-- {DATASET_NAME_1}
    |   |
    |   |-- train.json
    |   |-- test.json
    |   |-- word_vec.json
    |   |-- rel2id.json
    |
    |-- {DATASET_NAME_2}
    |   |
    |   |-- ...
    |
    |-- ...

You could use your own data or download datasets provided above.

  1. Run train_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME}. For example, if you want to train model with PCNN as the encoder and attention as the selector on the nyt dataset, run the following command
python train_demo.py nyt pcnn att

Currently {ENCODER_NAME} includes pcnn, cnn, rnn and birnn, and {SELECTOR_NAME} includes att (for attention), max (for maximum) and ave (for average). The model will be named as {DATASET_NAME}_{ENCODER_NAME}_{SELECTOR_NAME} automatically.

The checkpoint of the best epoch (each epoch will be validated while training) will be saved in ./checkpoint and results for plotting precision-recall curve will be saved in ./test_result by default.

  1. Use draw_plot.py to check auc, average precision, F1 score and precision-recall curve by the following command
python draw_plot.py {MODEL_NAME_1} {MODEL_NAME_2} ...

All the results of the models mentioned will be printed and precision-recall curves containing all the models will be saved in ./test_result/pr_curve.png.

  1. If you have the checkpoint of the model and want to evaluate it, run test_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME}. For example:
python test_demo.py nyt pcnn att

The prediction results will be stored in test_result/nyt_pcnn_att_pred.json.

Additional Modules

Reinforcement Learning (Feng et al. 2018)

We have implemented a reinforcement learning module following (Feng et al. 2018). There might be some slight differences in implementation details. The RL code is in nrekit/rl.py, and it can be added to any models by running:

python train_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME} rl

For example, by running python train_demo.py nyt pcnn att rl, you will get a pcnn_att model trained by RL.

For how the RL module helps alleviate false positive problem in DS data, please refer to the paper.

Test Results

NYT10 Dataset

AUC Results:

Model Attention Maximum Average
PCNN 0.3408 0.3247 0.3190
CNN 0.3277 0.3151 0.3044
RNN 0.3418 0.3473 0.3405
BiRNN 0.3352 0.3575 0.3244

Reference

  1. Neural Relation Extraction with Selective Attention over Instances. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, Maosong Sun. ACL2016. paper

  2. Adversarial Training for Relation Extraction. Yi Wu, David Bamman, Stuart Russell. EMNLP2017. paper

  3. A Soft-label Method for Noise-tolerant Distantly Supervised Relation Extraction. Tianyu Liu, Kexiang Wang, Baobao Chang, Zhifang Sui. EMNLP2017. paper

  4. Reinforcement Learning for Relation Classification from Noisy Data. Jun Feng, Minlie Huang, Li Zhao, Yang Yang, Xiaoyan Zhu. AAAI2018. paper