An open-source framework for neural relation extraction.
Contributed by Tianyu Gao, Xu Han, Shulian Cao, Lumin Tang, Yankai Lin, Zhiyuan Liu
If you want to learn more about neural relation extraction, visit another project of ours NREPapers.
BIG UPDATE: The project has been completely reconstructed and is faster, more extendable and the codes are easier to read and use now. If you need get to the old version, please refer to branch old_version.
New features:
- JSON data support.
- Multi GPU training.
- Validating while training.
It is a TensorFlow-based framwork for easily building relation extraction (RE) models. We divide the pipeline of relation extraction into four parts, which are embedding, encoder, selector (for distant supervision) and classifier. For each part we have implemented several methods.
- Embedding
- Word embedding
- Position embedding
- Encoder
- PCNN
- CNN
- RNN
- Bidirection RNN
- Selector
- Attention
- Maximum
- Average
- Classifier
- Softmax Loss Function
- Output
All those methods could be combined freely.
We also provide training and testing framework for sentence-level RE and bag-level RE. A plotting tool is also in the package.
This project is under MIT license.
- Python (>=2.7)
- Numpy (>=1.13.3)
- TensorFlow (>=1.4.1)
- CUDA (>=8.0) if you are using gpu
- Matplotlib (>=2.0.0)
- scikit-learn (>=0.18)
For training and testing, you should provide four JSON
files including training data, testing data, word embedding data and relation-ID mapping data.
Training data file and testing data file, containing sentences and their corresponding entity pairs and relations, should be in the following format
[
{
'sentence': 'Bill Gates is the founder of Microsoft .',
'head': {'word': 'Bill Gates', 'id': 'm.03_3d', ...(other information)},
'tail': {'word': 'Microsoft', 'id': 'm.07dfk', ...(other information)},
'relation': 'founder'
},
...
]
IMPORTANT: In the sentence part, words and punctuations should be separated by blank spaces.
Word embedding data is used to initialize word embedding in the networks, and should be in the following format
[
{'word': 'the', 'vec': [0.418, 0.24968, ...]},
{'word': ',', 'vec': [0.013441, 0.23682, ...]},
...
]
This file indicates corresponding IDs for relations to make sure during each training and testing period, the same ID means the same relation. Its format is as follows
{
'NA': 0,
'relation_1': 1,
'relation_2': 2,
...
}
IMPORTANT: Make sure the ID of NA
is always 0.
NYT10 is a distantly supervised dataset originally released by the paper "Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text.". Here is the download link for the original data.
We've provided a toolkit to convert the original NYT10 data into JSON format that OpenNRE
could use. You could download the original data + toolkit from Google Drive or Tsinghua Cloud. Further instructions are included in the toolkit.
-
Install all the required package.
-
Clone the OpenNRE repository:
git clone https://github.com/thunlp/OpenNRE.git
Since there are too many history commits of this project and the .git
folder is too large, you could use the following command to download only the latest commit:
git clone https://github.com/thunlp/OpenNRE.git --depth 1
- Make data folder in the following structure
OpenNRE
|-- ...
|-- data
|
|-- {DATASET_NAME_1}
| |
| |-- train.json
| |-- test.json
| |-- word_vec.json
| |-- rel2id.json
|
|-- {DATASET_NAME_2}
| |
| |-- ...
|
|-- ...
You could use your own data or download datasets provided above.
- Run
train_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME}
. For example, if you want to train model with PCNN as the encoder and attention as the selector on thenyt
dataset, run the following command
python train_demo.py nyt pcnn att
Currently {ENCODER_NAME}
includes pcnn
, cnn
, rnn
and birnn
, and {SELECTOR_NAME}
includes att
(for attention), max
(for maximum) and ave
(for average). The model will be named as {DATASET_NAME}_{ENCODER_NAME}_{SELECTOR_NAME}
automatically.
The checkpoint of the best epoch (each epoch will be validated while training) will be saved in ./checkpoint
and results for plotting precision-recall curve will be saved in ./test_result
by default.
- Use
draw_plot.py
to check auc, average precision, F1 score and precision-recall curve by the following command
python draw_plot.py {MODEL_NAME_1} {MODEL_NAME_2} ...
All the results of the models mentioned will be printed and precision-recall curves containing all the models will be saved in ./test_result/pr_curve.png
.
- If you have the checkpoint of the model and want to evaluate it, run
test_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME}
. For example:
python test_demo.py nyt pcnn att
The prediction results will be stored in test_result/nyt_pcnn_att_pred.json
.
We have implemented a reinforcement learning module following (Feng et al. 2018). There might be some slight differences in implementation details. The RL code is in nrekit/rl.py
, and it can be added to any models by running:
python train_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME} rl
For example, by running python train_demo.py nyt pcnn att rl
, you will get a pcnn_att
model trained by RL.
For how the RL module helps alleviate false positive problem in DS data, please refer to the paper.
AUC Results:
Model | Attention | Maximum | Average |
---|---|---|---|
PCNN | 0.3408 | 0.3247 | 0.3190 |
CNN | 0.3277 | 0.3151 | 0.3044 |
RNN | 0.3418 | 0.3473 | 0.3405 |
BiRNN | 0.3352 | 0.3575 | 0.3244 |
-
Neural Relation Extraction with Selective Attention over Instances. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, Maosong Sun. ACL2016. paper
-
Adversarial Training for Relation Extraction. Yi Wu, David Bamman, Stuart Russell. EMNLP2017. paper
-
A Soft-label Method for Noise-tolerant Distantly Supervised Relation Extraction. Tianyu Liu, Kexiang Wang, Baobao Chang, Zhifang Sui. EMNLP2017. paper
-
Reinforcement Learning for Relation Classification from Noisy Data. Jun Feng, Minlie Huang, Li Zhao, Yang Yang, Xiaoyan Zhu. AAAI2018. paper