RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information
Source code for EMNLP 2018 paper: RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information. Also includes implementation of PCNN, PCNN+ATT, CNN, CNN+ATT, and BGWA models.
Overview of RESIDE (proposed method): RESIDE first encodes each sentence in the bag by concatenating embeddings (denoted by ⊕) from Bi-GRU and Syntactic GCN for each token, followed by word attention. Then, sentence embedding is concatenated with relation alias information, which comes from the Side Information Acquisition Section, before computing attention over sentences. Finally, bag representation with entity type information is fed to a softmax classifier. Please refer to paper for more details.
Dependencies
- Compatible with TensorFlow 1.x and Python 3.x.
- Dependencies can be installed using
requirements.txt
.
Dataset:
-
We use Riedel NYT and Google IISc Distant Supervision (GIDS) dataset for evaluation.
-
Datasets in json list format with side information can be downloaded from here: RiedelNYT and GIDS.
-
The processed version of the datasets can be downloaded from RiedelNYT and GIDS. The structure of the processed input data is as follows.
{ "voc2id": {"w1": 0, "w2": 1, ...}, "type2id": {"type1": 0, "type2": 1 ...}, "rel2id": {"NA": 0, "/location/neighborhood/neighborhood_of": 1, ...} "max_pos": 123, "train": [ { "X": [[s1_w1, s1_w2, ...], [s2_w1, s2_w2, ...], ...], "Y": [bag_label], "Pos1": [[s1_p1_1, sent1_p1_2, ...], [s2_p1_1, s2_p1_2, ...], ...], "Pos2": [[s1_p2_1, sent1_p2_2, ...], [s2_p2_1, s2_p2_2, ...], ...], "SubPos": [s1_sub, s2_sub, ...], "ObjPos": [s1_obj, s2_obj, ...], "SubType": [s1_subType, s2_subType, ...], "ObjType": [s1_objType, s2_objType, ...], "ProbY": [[s1_rel_alias1, s1_rel_alias2, ...], [s2_rel_alias1, ... ], ...] "DepEdges": [[s1_dep_edges], [s2_dep_edges] ...] }, {}, ... ], "test": { same as "train"}, "valid": { same as "train"}, }
voc2id
is the mapping of word to its idtype2id
is the maping of entity type to its id.rel2id
is the mapping of relation to its id.max_pos
is the maximum position to consider for positional embeddings.- Each entry of
train
,test
andvalid
is a bag of sentences, whereX
denotes the sentences in bag as the list of list of word indices.Y
is the relation expressed by the sentences in the bag.Pos1
andPos2
are position of each word in sentences wrt to target entity 1 and entity 2.SubPos
andObjPos
contains the position of the target entity 1 and entity 2 in each sentence.SubType
andObjType
contains the target entity 1 and entity 2 type information obtained from KG.ProbY
is the relation alias side information (refer paper) for the bag.DepEdges
is the edgelist of dependency parse for each sentence (required for GCN).
Evaluate pretrained model:
reside.py
contains TensorFlow (1.x) based implementation of RESIDE (proposed method).- Download the pretrained model's parameters from RiedelNYT and GIDS (put downloaded folders in
checkpoint
directory). - Execute
evaluate.sh
for comparing pretrained RESIDE model against baselines (plots Precision-Recall curve).
Side Information:
- Entity Type information for both the datasets is provided in
side_info/type_info.zip
.- Entity type information can be used directly in the model.
- Relation Alias Information for both the datasets is provided in
side_info/relation_alias.zip
.
Training from scratch:
- Execute
setup.sh
for downloading GloVe embeddings. - For training RESIDE run:
python reside.py -data data/riedel_processed.pkl -name new_run
-
The above model needs to be further trained with SGD optimizer for few epochs to match the performance reported in the paper. For that execute
python reside.py -name new_run -restore -opt sgd -lr 0.001 -l2 0.0 -epoch 4
-
Finally, run
python plot_pr.py -name new_run
to get the plot.
Baselines:
-
The repository also includes code for PCNN, PCNN+ATT, CNN, CNN+ATT, BGWA models.
-
For training PCNN+ATT:
python pcnnatt.py -data data/riedel_processed.pkl -name new_run -attn # remove -attn for PCNN
-
Similarly for training CNN+ATT:
python cnnatt.py -data data/riedel_processed.pkl -name new_run # remove -attn for CNN
-
For training BGWA:
python bgwa.py -data data/riedel_processed.pkl -name new_run
Preprocessing a new dataset:
preproc
directory contains code for getting a new dataset in the required format (riedel_processed.pkl
) forreside.py
.- Get the data in the same format as followed in riedel_raw or gids_raw for
Riedel NYT
dataset. - Finally, run the script
preprocess.sh
.make_bags.py
is used for generating bags from sentence.generate_pickle.py
is for converting the data in the required pickle format.
Running pretrained model on new samples:
-
The code for running pretrained model on a sample is included in
online
directory. -
A flask based server is also provided. Use
python online/server.py
to start the server.- riedel_test_bags.json and other required files can be downloaded from the provided links.
Citation:
Please cite the following paper if you use this code in your work.
@inproceedings{reside2018,
author = "Vashishth, Shikhar and
Joshi, Rishabh and
Prayaga, Sai Suman and
Bhattacharyya, Chiranjib and
Talukdar, Partha",
title = "{RESIDE}: Improving Distantly-Supervised Neural Relation Extraction using Side Information",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
month = oct # "-" # nov,
address = "Brussels, Belgium",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "1257--1266",
url = "http://aclweb.org/anthology/D18-1157"
}
For any clarification, comments, or suggestions please create an issue or contact Shikhar.