Source code and dataset for ACL 2018 paper: Document Dating using Graph Convolution Networks.
Overview of NeuralDater (proposed method). NeuralDater exploits syntactic and temporal structure in a document to learn effective representation, which in turn are used to predict the document time. NeuralDater uses a Bi-directional LSTM (Bi-LSTM), two Graph Convolution Networks (GCN) – one over the dependency tree and the other over the document’s temporal graph – along with a softmax classifier, all trained end-to-end jointly. Please refer paper for more details.
- Compatible with TensorFlow 1.x and Python 3.x.
- Dependencies can be installed using
requirements.txt
.
- Download the processed version (includes dependency and temporal graphs of each document) of NYT and APW datasets.
- Unzip the
.pkl
file indata
directory. - Documents are originally taken from NYT and APW section of Gigaword Corpus, 5th ed.
For getting temporal graph of new documents. The following steps need to be followed:
-
Setup CAEVO and CATENA as explained in their respective repositories.
-
For extracting event and time mentions of a document
-
./runcaevoraw.sh <path_of_document>
-
Above command generates an
.xml
file. This is used by CATENA for extracting temporal graph and it also contains the dependency parse information of the document which can be extracted using the following command:python preprocess/read_caveo_out.py <caevo_out_path> <destination_path>
-
-
For making the generated
.xml
file compatible for input to CATENA, use the following script aspython preprocess/make_catena_input.py <caevo_out_path> <destination_path>
-
.xml
generated above is given as input to CATENA for getting the temporal graph of the document.java -Xmx6G -jar ./target/CATENA-1.0.3.jar -i <path_to_xml> \ --tlinks ./data/TempEval3.TLINK.txt \ --clinks ./data/Causal-TimeBank.CLINK.txt \ -l ./models/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model \ -g ./models/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model \ -p ./models/CoNLL2009-ST-English-ALL.anna-3.3.parser.model \ -x ./tools/TextPro2.0/ -d ./models/catena-event-dct.model \ -t ./models/catena-event-timex.model \ -e ./models/catena-event-event.model -c ./models/catena-causal-event-event.model > <destination_path>
The above command outputs the list of links in the temporal graph which are given as input to NeuralDater. The output file can be read using the following command:
python preprocess/read_catena_out.py <catena_out_path> <destination_path>
-
After installing python dependencies from
requirements.txt
, executesh setup.sh
for downloading GloVe embeddings. -
neural_dater.py
contains TensorFlow (1.x) based implementation of NeuralDater (proposed method). -
To start training:
python neural_dater.py -data data/nyt_processed_data.pkl -class 10 -name test_run
-class
denotes the number of classes in datasets,10
for NYT and16
for APW.-name
is arbitrary name for the run.
@InProceedings{P18-1149,
author = "Vashishth, Shikhar and Dasgupta, Shib Sankar and Ray, Swayambhu Nath and Talukdar, Partha",
title = "Dating Documents using Graph Convolution Networks",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "1605--1615",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1149"
}