This is the Contemporary Amharic Corpus (CACO) version 1.1. CACO is collected from various sources which are proofread or edited. The corpus contains about 24 million orthographic words. Since it is partly a web corpus, we made some automatic spelling error corrections. Though the corpus was originally tagged with HornMorpho v3, you'll need to download HornMorpho v4 to run the scripts in this repository (if you plan on running any custom morphological analyses). You can download it from: https://github.com/hltdi/HornMorpho
You can perform basic functions such as getting examples of words in context or morphological analyses from the Flask app.
To start it, first install all dependencies with poetry install
or equivalent.*
Then, run gunicorn app.main:app
to start a production server.
From there, you can use the pages linked to perform basic functions.
Note that the web UI accesses a representative sample of the corpus (about 1% its size), so it won't have words that are not very frequent.
*See installation instructions for HornMorpho if you get an error installing from requirements; HornMorpho will need to be built separately but you can install the rest of the requirements as normal.
Download and extract the corpus using the link below before running any of the processing scripts in this repository.
The documents are provided in plain text format and XML format. The XML documents are the tagged versions of the plain text documents. For more details about the corpus, refer to the original publication.
The download directory houses the three main files that you'll want to extract (the result of running the scripts listed in src).
freq.csv
is a collection of the most frequent lemmas (not words) in the corpus and relevant statistics about their distribution.
Note that unlike the standard frequency list shipped with the corpus, this frequency list groups lemmas together, so e.g. መሆን and ነው are merged into the verb root ሆነ.
sample.txt
is a representative sample of the main corpus that is 1% of its size, which makes it ideal for running custom analyses.
It contains a random selection of lines from the main corpus, so its contents are truly representative of the entire corpus and not a specific part or source within it.
p-sample.txt
contains extracted lemma information from the sample, the result of analyzing it with HornMorpho.
If you plan on reproducing the frequency analysis or performing your own custom analysis, you'll need to download the corpus, install HornMorpho v4, and setup the relevant configuration for the scripts in the src
directory.
Also make sure you install the required dependencies with poetry install
or equivalent (Python 3.11 is recommended).
Create a folder in this repository called config
and place a file main.toml
within it.
This file should have the following structure:
[locs]
dir = "/abs/path/to/CACO"
sample = "/abs/path/to/artifacts"
The dir
field should point to the folder containing the main corpus data.
The sample
field should point to another folder for storing generated artifacts from the scripts in the src
directory.
All the scripts have docstrings explaining usage instructions.
You may wish to download the representative sample and place it in the sample
directory with the name sample.txt
to avoid having to run the segment.py
script, which takes a long time to run and yields non-deterministic results.
To reproduce the frequency analysis after generating or copying sample.txt
, run divide.py
, process.py
, combine.py
, and then freq.py
, in that order.
After installing HornMorpho, open the following script in your site packages*:
File "python3.11/site-packages/hm/morpho/language.py", line 925, in convert_phones
self.epenthesis(phones)
*Line number may differ depending on installation details. Search for self.epenthesis(phones)
in language.py
to find the exact line number.
Then, wrap the self.epenthesis(phones)
call in a try
block:
try: self.epenthesis(phones)
except: pass
This will prevent any unforeseen errors with the morphological analyzer when running it on the corpus.
10.24352/UB.OVGU-2018-144
http://dx.doi.org/10.24352/ub.ovgu-2018-144
All the documents in the corpus are documents which have been made publicly available in the Web. The corpus has been obtained by crawling the Web. In this distribution, for copyright reasons, the order of sentences are shuffled. By downloading this corpus you agree that the corpus should only be used for research purposes.
When using this data, please cite the original publication:
Gezmu, Andargachew Mekonnen, Binyam Ephrem Seyoum, Michael Gasser, and Andreas Nürnberger. "Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus." In Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pp. 65-70. 2018. Available at: http://www.aclweb.org/anthology/W18-3809