hyw-corpus

Overview

This repo contains the corpora that we used to evaluate the Apertium morphological analyzer for Western Armenian. We created corpora through a mix of manual and automated scraping. The different corpora are stored in the different subfolders. Each folder has a README file that explains the individual corpora and how we used them.

The corpora and the folders are the following:

  • Bibles for Western and Eastern Armenian
  • A Newspaper corpus for Western Armenian
  • UD Treebanks for Western and Eastern Armenian
  • Wikipedia for Western and Eastern Armenian

To measure precision and recall, we used the items and code in the precisionRecall folder.

Helpful commands

To evaluate the analyzer over some corpus, do the following:

  1. Clone this repo and the apertium-hyw repo.
  2. To get the analyzer, run make or either hyx@hyw.automorf or hyx@hye.automorf.
  3. To run the Western Armenian analyzer on some corpus (CORPUS), run the following command: sh coverage-ltproc.sh CORPUS ../apertium-hyw/hyx@hyw.automorf.bin
  4. Open the temp folder by running the following command: open /tmp
  5. Find the filename of the parade file and copy the file name. The file name can look something like CORPUS.parade.txt
  6. To get a list of tokens and their analysis, run the following command: cat /tmp/CORPUS.parade.txt | lt-proc ../apertium-hyw/hyx@hyw.automorf.bin | apertium-cleanstream -n > toks.txt
  7. To get a list of unknown tokens, run the following command: cat /tmp/CORPUS.parade.txt | grep '\*' | sort | uniq -c | sort -rn > unks.txt