Word Corpus

Scripts that extract a word corpus from OpenStreetMap, Wikipedia, and Wikidata targeting South-East Asian and Indic languages.

Data License

OpenStreetMap-derived data is licensed under the Open Data Commons Open Database License (ODbL). See https://www.openstreetmap.org/copyright
Wikipedia-derived data is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA). See https://en.wikipedia.org/wiki/Wikipedia:Copyrights
Wikidata-derived data is licensed under the Creative Commons CC0 License. See https://www.wikidata.org/wiki/Wikidata:Licensing

Download corpus with duplicates. These files contain only non-Latin/Greek/Cyrillic/CJK text as defined in the file latin_greek_cyrillic_cjk.py.

Download corpus for a single script without duplicates:

If you need some other language or script, please open an Issue on GitHub...

Download data sources:

cd osm/
python3 download.py
cd ../wikidata/
python3 download.py
cd ../wikipedia/
python3 download.py

Extract non-Latin/Greek/Cyrillic/CJK text from sources with:

python3 extract.py

Generate word corpus with duplicates with:

python3 generate_corpus.py

Filter the corpus for a single script:

python3 filter_by_script.py