/UDConverter

A treebank format converter for converting PPCHE-style treebanks into UD treebanks.

Primary LanguagePythonApache License 2.0Apache-2.0

Treebank format converter

Version 1.1

A Python module for converting bracket-parsed PPCHE-format treebanks to the Universal Dependencies framework. It is heavily based on existing NLTK packages.

The module is specifically configured to convert treebanks in the IcePaHC format, which is based on PPCME.

The converter has been used to create two Icelandic UD treebanks: UD_Icelandic-IcePaHC and UD_Icelandic-Modern, and one Faroese: UD_Faroese-FarPaHC.

Version 1.1 has an 82.87 LAS.

Setup

Install all requirements by running:

pip install -r requirements.txt

Usage

Scripts to run are in the scripts folder.

In all examples below, the --output flag is used to write to files in the /CoNLLU/ output folder. Otherwise prints to standard output.

Convert single file or directory of files:

convert.py -N -i path/to/corpus/file.psd --output --post_process

convert.py -N -i path/to/corpus/* --output --post_process

For further usage, input files must be placed in a folder within the corpora folde:r

Convert single tree in treebank using sentence ID (only prints to standard output):

convert.py -C FOLDER_NAME -id SENTENCE_ID

Convert single file in treebank

convert.py -C FOLDER_NAME -f FILE_NAME --output --post_process

Additionally included is a script to only convert the IcePaHC corpus ( icepahc-v0.9), with pre- and post-processing:

convert_icepahc.py

Acknowledgements

This converter is part of the UniTree project for IcePaHC, funded by The Strategic Research and Development Programme for Language Technology, grant no. 180020-5301. Thanks are due to Örvar Kárason, whose previous work was used as a basis for the conversion.

This converter was improved as part of the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur (https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.