Scripts that extract a word corpus from OpenStreetMap, Wikipedia, and Wikidata targeting South-East Asian and Indic languages.
- OpenStreetMap-derived data is licensed under the Open Data Commons Open Database License (ODbL). See https://www.openstreetmap.org/copyright
- Wikipedia-derived data is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA). See https://en.wikipedia.org/wiki/Wikipedia:Copyrights
- Wikidata-derived data is licensed under the Creative Commons CC0 License. See https://www.wikidata.org/wiki/Wikidata:Licensing
Download corpus with duplicates. These files contain only non-Latin/Greek/Cyrillic/CJK text as defined in the file latin_greek_cyrillic_cjk.py.
- osm-corpus-with-duplicates.txt.zip (19M)
- wikipedia-corpus-with-duplicates.txt.zip (816M)
- wikidata-corpus-with-duplicates.txt.zip (118M)
Download corpus for a single script without duplicates:
- Devanagari
- Myanmar
If you need some other language or script, please open an Issue on GitHub...
Download data sources:
cd osm/
python3 download.py
cd ../wikidata/
python3 download.py
cd ../wikipedia/
python3 download.py
Extract non-Latin/Greek/Cyrillic/CJK text from sources with:
python3 extract.py
Generate word corpus with duplicates with:
python3 generate_corpus.py
Filter the corpus for a single script:
python3 filter_by_script.py