hyw-corpus

Overview

This repo contains the corpora that we used to evaluate the Apertium morphological analyzer for Western Armenian. We created corpora through a mix of manual and automated scraping. The different corpora are stored in the different subfolders. Each folder has a README file that explains the individual corpora and how we used them.

The corpora and the folders are the following:

Bibles for Western and Eastern Armenian
A Newspaper corpus for Western Armenian
UD Treebanks for Western and Eastern Armenian
Wikipedia for Western and Eastern Armenian

To measure precision and recall, we used the items and code in the precisionRecall folder.

Helpful commands

To evaluate the analyzer over some corpus, do the following:

Clone this repo and the apertium-hyw repo.
To get the analyzer, run make or either hyx@hyw.automorf or hyx@hye.automorf.
To run the Western Armenian analyzer on some corpus (CORPUS), run the following command: sh coverage-ltproc.sh CORPUS ../apertium-hyw/hyx@hyw.automorf.bin
Open the temp folder by running the following command: open /tmp
Find the filename of the parade file and copy the file name. The file name can look something like CORPUS.parade.txt
To get a list of tokens and their analysis, run the following command: cat /tmp/CORPUS.parade.txt | lt-proc ../apertium-hyw/hyx@hyw.automorf.bin | apertium-cleanstream -n > toks.txt
To get a list of unknown tokens, run the following command: cat /tmp/CORPUS.parade.txt | grep '\*' | sort | uniq -c | sort -rn > unks.txt

mr-martian/hyw-corpus

hyw-corpus

Overview

Helpful commands