/Neural-Wikipedian

Primary LanguageC++Apache License 2.0Apache-2.0

Neural-Wikipedian

This repository contains the code along with the datasets of the work that has been submitted as a research paper to the Journal of Web Semantics. The work focuses on how an adaptation of the encoder-decoder framework can be used to generate textual summaries for Semantic Web triples.

For a detailed description of the work presented in this repository, please refer to the preprint version of the submitted paper at: https://arxiv.org/abs/1711.00155.

Datasets

In order to train our proposed models, we built two datasets of aligned knowledge base triples with text.

  • D1: DBpedia triples aligned with Wikipedia biographies
  • D2: Wikidata triples aligned with Wikipedia biographies

In a Unix shell environment execute: sh download_datasets.sh in order to download and uncompress both of them in their corresponding folders (i.e. D1 and D2). Each dataset folder consists of three different sub-folders:

  • data contains each aligned dataset in binary-encoded pickle files. Each file is a hash table. Each hash table is a Python dictionary of lists.
  • utils contains each dataset's supporting files, such as hash tables of the frequency with which surface forms in the Wikipedia summaries have been mapped to entity URIs. All the files are binary-encoded in pickle files.
  • processed contains the processed version of each aligned dataset after removal of potential outliers (e.g. instances of the datasets with extremely long Wikipedia summaries or very few triples). The files that are contained in the processed folders are the ones that are used for the training and testing of both our neural-network-based systems and the baselines.

Inspect-Dataset.ipynb is a Python script on iPython Notebook that allows easier inspection of the above aligned datasets. The scripts provides also detailed information regarding the structure of the intermediate parts in D1/data/ and D2/data/ and the functionality of the supporting files in D1/utils/ and D2/utils/.

The table below presents the distribution of the 10 most common predicates, and entities in our two datasets, D1 and D2 respectively.

Predicates In Triples % Entities In Triples % Entities In Summaries %
dbo:birthDate 12.43 dbr:United_States 0.49 dbr:United_States 2.82
dbo:birthPlace 10.67 dbr:England 0.19 dbr:Actor 2.14
dbo:careerStation 5.47 dbr:United_Kingdom 0.14 dbr:Association_football 1.02
dbo:deathDate 5.11 dbr:France 0.14 dbr:Politician 0.97
dbo:occupation 5.06 dbr:Canada 0.12 dbr:Singing 0.90
dbo:team 4.18 dbr:India 0.11 dbr:United_Kingdom 0.59
dbo:deathPlace 3.51 dbr:Actor 0.10 dbr:England 0.58
dbo:genre 3.22 dbr:Italy 0.10 dbr:Writer 0.53
dbo:associatedBand 2.85 dbr:London 0.10 dbr:Canada 0.50
dbp:associatedMusicalArtist 2.85 dbr:Japan 0.09 dbr:France 0.49
Predicates In Triples % Entities In Triples % Entities In Summaries %
wikidata:P569
(place of birth)
14.15 wikidata:Q5
(human)
3.96 wikidata:Q30
(United States of America)
3.20
wikidata:P106
(occupation)
11.63 wikidata:Q6581097
(male)
3.27 wikidata:Q33999
(actor)
1.56
wikidata:P31
(instance of)
8.29 wikidata:Q30
(United States of America)
1.13 wikidata:Q82955
(politician)
1.02
wikidata:P21
(sex or gender)
7.92 wikidata:Q6581072
(female)
0.70 wikidata:Q21
(England)
0.87
wikidata:P570
(date of death)
7.58 wikidata:Q145
(United Kingdom)
0.44 wikidata:Q145
(United Kingdom)
0.85
wikidata:P27
(country of citizenship)
6.75 wikidata:Q82955
(politician)
0.42 wikidata:Q27939
(singing)
0.79
wikidata:P735
(given name)
6.53 wikidata:Q1860
(English)
0.39 wikidata:Q36180
(writer)
0.71
wikidata:P19
(place of birth)
5.20 wikidata:Q33999
(actor)
0.36 wikidata:Q2736
(association football)
0.68
wikidata:P5
(member of sports team)
2.64 wikidata:Q36180
(writer)
0.24 wikidata:Q183
(Germany)
0.61
wikidata:P69
(educated at)
2.58 wikidata:Q177220
(singer)
0.20 wikidata:Q16
(Canada)
0.58

Our Systems

The Systems directory contains all the code to both train and generate summaries for the sets of triples that are located in the validation and test sets of our datasets. It contains our two models in two separate sub-folders (i.e. Triples2GRU and Triples2LSTM). The neural network models are implemented using the Torch package. We conducted our experiments on a single Titan X (Pascal) GPU. Please make sure that Torch along with the torch-hdf5 package and the NVIDIA CUDA drivers are installed in your machine before executing any of the .lua files in these directories.

  • You can train your own Triples2LSTM or Triples2GRU models, by executing th train.lua inside each system's directory. You need to have access to a GPU with at least 11 GB of memory in order to train the models with the same hyperparameters that we used in the paper. However, by lowering the params.batch_size and params.rnn_size variables you can train on NVIDIA GPUs will less amount of dedicated memory. By altering the dataset_path and checkpoint_path variables in each train.lua file, you can select the dataset (i.e. D1 or D2) on which you will be training your model, and whether you will using the surface form tuples or URIs setup. The checkpoint files of the trained models will be saved in the corresponding checkpoints directory.

  • You can use a checkpoint of a trained model to start generating summaries given input sets of triples from the validation and test sets of the aligned datasets by executing th beam-sample.lua. Please make sure that the pre-trained model (i.e. on D1 or D2, with URIs or surface form tuples) matches the dataset that will be loaded in the beam_sampling_params.dataset_path variable. You can download all our trained models and generate summaries from them by running the shell scripts located at:

  • Systems/Triples2LSTM/download_trained_models.sh

  • Systems/Triples2GRU/download_trained_models.sh

The generated summaries will be saved as HDF5 files in the directory of the pre-trained model. Our trained models use CUDA Tensors. Consequently, the NVIDIA CUDA drivers along with the cutorch and cunn Lua packages should be installed in your machine. The latter can be installed by running:

luarocks install cutorch
luarocks install cunn
  • Execute the Python script beam-sample.py in order to create a .csv file with the sampled summaries. The following Python packages: (i) h5py, (ii) pandas, and (iii) numpy should be installed in your machine. The script replaces the <item> tokens along with the property-type placeholders, and presents the generated summaries along with the input sets of triples and the actual Wikipedia summaries in the resultant .csv file. The .csv file will by default be saved in the location of the pre-trained model.

For all possible alteration in the parameters of the above files, please consult their corresponding comment sections.

KenLM

The KenLM directory contains all the required code in order to train an n-gram Kneser-Ney language model. The code is based on the KenLM Language Model Toolkit. The binary files that reside in the ./kenlm/build/ directory have been compiled using Boost on a machine running Ubuntu 16.04 (x86_64 Linux 4.4.0-98-generic). In case you wish to experiment with this baseline on a different OS, you need to download and compile the original package according to the instructions at https://kheafield.com/code/kenlm/.

The following Python packages should also be installed in your machine: (i) numpy, (ii) pandas, and (iii) kenlm. The latter can be installed by running: pip install https://github.com/kpu/kenlm/archive/master.zip (i.e. https://github.com/kpu/kenlm).

  • In a Unix shell environment, run: sh train.sh in order to train a 5-gram Kneser-Ney language model. The trained model will be saved in the ./KenLM/ directory with the .klm extension (e.g. D1.surf_form_tuples.model.klm or D2.surf_form_tuples.model.klm).
  • Execute the Python script sample.py in order to sample the most probable summary templates. The summaries are sampled using beam-search. The most probable templates will be saved in a pickle file (e.g. D1.surf_form_tuples.templates.p or D2.surf_form_tuples.templates.p) in the ./KenLM/templates/ directory.
  • Run the Python script process-templates.py in order to post-process the templates according to each input set of triples from the test or validation set of the selected dataset. The script replaces the <item> tokens along with any potential property-type placeholders according to the triples of the input set. The generated .csv file with all the generated summaries along with their input sets of triples is saved in the ./KenLM/templates/ directory.

In the default scenario, the model trains on D1 and samples summaries for the sets of triples that have been allocated to the test set. In case you wish to run the files (i.e. train.sh, sample.py and process-templates.py) in a different setup, you can alter them following the guidelines in each file's comment sections.

License

This project is licensed under the terms of the Apache 2.0 License.