/word-corpus

Scripts that extract a word corpus from OpenStreetMap, Wikipedia, and Wikidata targeting South-East Asian and Indic languages.

Primary LanguagePythonMIT LicenseMIT

Word Corpus

Scripts that extract a word corpus from OpenStreetMap, Wikipedia, and Wikidata targeting South-East Asian and Indic languages.

Data License

Downloads

Download corpus with duplicates. These files contain only non-Latin/Greek/Cyrillic/CJK text as defined in the file latin_greek_cyrillic_cjk.py.

Single Scripts

Download corpus for a single script without duplicates:

If you need some other language or script, please open an Issue on GitHub...

Steps

Download data sources:

cd osm/
python3 download.py
cd ../wikidata/
python3 download.py
cd ../wikipedia/
python3 download.py

Extract non-Latin/Greek/Cyrillic/CJK text from sources with:

python3 extract.py

Generate word corpus with duplicates with:

python3 generate_corpus.py

Filter the corpus for a single script:

python3 filter_by_script.py