This repo contains the corpora that we used to evaluate the Apertium morphological analyzer for Western Armenian. We created corpora through a mix of manual and automated scraping. The different corpora are stored in the different subfolders. Each folder has a README file that explains the individual corpora and how we used them.
The corpora and the folders are the following:
- Bibles for Western and Eastern Armenian
- A Newspaper corpus for Western Armenian
- UD Treebanks for Western and Eastern Armenian
- Wikipedia for Western and Eastern Armenian
To measure precision and recall, we used the items and code in the precisionRecall folder.
To evaluate the analyzer over some corpus, do the following:
- Clone this repo and the apertium-hyw repo.
- To get the analyzer, run
make
or eitherhyx@hyw.automorf
orhyx@hye.automorf
. - To run the Western Armenian analyzer on some corpus (
CORPUS
), run the following command:sh coverage-ltproc.sh CORPUS ../apertium-hyw/hyx@hyw.automorf.bin
- Open the temp folder by running the following command:
open /tmp
- Find the filename of the parade file and copy the file name.
The file name can look something like
CORPUS.parade.txt
- To get a list of tokens and their analysis, run the following command:
cat /tmp/CORPUS.parade.txt | lt-proc ../apertium-hyw/hyx@hyw.automorf.bin | apertium-cleanstream -n > toks.txt
- To get a list of unknown tokens, run the following command:
cat /tmp/CORPUS.parade.txt | grep '\*' | sort | uniq -c | sort -rn > unks.txt