/multilingual_disco_data

Preprocessing scripts for the mtg parser.

Primary LanguagePython

Multilingual Discontinuous Data

This repository contains scripts to generate data in the input format of the mtg parser. Process three corpora:

bash generate_english_data.sh
bash generate_tiger_data.sh
bash generate_negra_data.sh

Dependencies:

  • python3
  • java (>= 1.8)
  • discodop
  • treetools (install the version of treetools for python2, since the version for python 3 seems to have a bug for the transform option)

Data required (and not included):

  • English:
  • German (Tiger): corpus_data/GERMAN_SPMRL.tar.gz (SPMRL version of TiGer corpus)
  • German (Negra): corpus_data/negra-corpus.tar.gz

For English, the script uses the Stanford parser to convert the ptb to conll dependency trees.

For the Negra corpus, the script uses a modified version of depsy to convert it to dependency trees (the modification just makes sure that the tokenization is not changed by Depsy).