/WikiBank

WikiBank is a new partially annotated resource for multilingual frame-semantic parsing task.

Primary LanguagePythonMIT LicenseMIT

WikiBank

WikiBank is a new partially annotated resource for multilingual frame-semantic parsing task.

Available Datasets

The available datasets are for 5 languages: EN, ES, DE, FR, and IT and their are in the dataset folder.

Procedure for creations

NOTES: The space required is round 1TB, so be sure to have the required amount of space before starting the process.

Requirements

  1. MongoDB
  2. Python

Required files

  1. Download Wikidata JSON dump from here
  2. Download Wikipedia XML dump from here, or a JSON dump from here (download the "content" one).

If using the XML dump, converto it to JSON using one of the tools present in this page.

Data preprocessing

To merge Wikidata and Wikipedia, we need to have in both documents the Wikidata id. If your Wikipedia dump, doesn't contain this filed, to can compute a mapping from wikipedia id to wikidata id using the script "src/scripts/wiki_props.py" and the dump of the Wikipedia properties (here - called wiki-latest-page_props.sql) and then use the output file to add the wikidata id to the JSON document.

Data import

  1. Import the Wikidata dump into MongoDB in it's own collection using:
    mongoimport --db WikiSRL --collection wikidata --file wikidata_dump.json --jsonArray
  2. Create an index on the "id" field
    db.wikidata.createIndex({"id": 1})
    
  3. Import the JSON wikipedia dump into MongoDB
  4. Create an index on the wikidata id field:
    db.wikidata.createIndex({"wikidata_id": 1})
    

Data integration

  1. To merge Wikidata and wikipedia configure the config.py file, and then run merge_wikis.py

SRL extraction

  1. To extract the triples and create the SRL file, configure the config.py file, and run srl.py