A set of tools for data preprocessing or settings file generation for tsakorpus.
A tool that collects glosses and POS tags used in the corpus and generates a report and settings files: conversion_settings.json, categories.json, and grammRules.csv (tab-delimited). All files are generated in the corpus directory. If conversion_settings.json is already present there, it is copied to conversion_settings.json.bak, and then updated. It is presumed that the generated files are further processed manually by a linguist, who should remove erroneously collected glosses, adjust the gloss-to-tag rules, etc.
Command line examples:
python prepare_gloss_settings.py -f tei --dir C:\Work\HZSK\corpora\selkup_tei -l selkup --pos ps --gloss ge
python prepare_gloss_settings.py -f exb --dir C:\Work\HZSK\corpora\selkup -l selkup --pos ps --gloss ge
(use python3
if you have also Python 2 installed)
Type python prepare_gloss_settings.py -h
for help.
The tools require the following software to run:
- python >= 3.5
- python modules: lxml (you can use requirements.txt)
The software is distributed under MIT license (see LICENSE.md).