/corpuscrawler

Crawler for linguistic corpora

Primary LanguagePythonOtherNOASSERTION

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

Supported Languages

IETF BCP47 Code Language Tokens¹
ae Avestan 129K 💾
ae-Latn Avestan (Latin) 141K 💾
am Amharic 2,170K 💾
az Azerbaijani 3,413K 💾
be Belarusian 1,441K 💾
bg Bulgarian 10,597K 💾
bm Bambara 30K 💾
bn Bangla 7,258K 💾
bo Tibetan 5,642K 💾
bs Bosnian 8,993K 💾
ccp Chakma 79K 💾
cs Czech 3,141K 💾
de German 7,894K² 💾
dz Dzongkha 61K 💾
el Greek 5,470K 💾
es Spanish 32,670K 💾
fa Persian 9,114K 💾
fa-AF Dari 7,363K 💾
fi Finnish 4,837K 💾
fit Tornedalen Finnish 292K 💾
fo Faroese 851K 💾
fuv Nigerian Fulfulde 13K 💾
ga Irish 298K 💾
gd Scottish Gaelic 17,105K 💾
gsw-u-sd-chag Swiss German (Aargau) 99K 💾
gsw-u-sd-chbe Swiss German (Bern) 73K 💾
gsw-u-sd-chfr Swiss German (Fribourg) 42K 💾
gv Manx Gaelic 152K 💾
ha Hausa 1,775K 💾
haw Hawaiian 2,221K 💾
hi Hindi 10,004K 💾
hr Croatian 8,188K 💾
id Indonesian 6,634K 💾
ig Igbo 13K 💾
ja Japanese 2,116K 💾
kj Kuanyama 1,474K 💾
kk Kazakh 642K 💾
km Khmer 20,908K 💾
ku Kurdish 2,479K 💾
ky Kyrgyz 4,380K² 💾
la Latin 48K 💾
lo Lao 4,384K 💾
mi Maori 1,504K 💾
mk Macedonian 10,422K 💾
mnw Mon 1,836K 💾
mt Maltese 3,331K 💾
my Burmese 1,007K 💾
my-t-d0-zawgyi Burmese (Zawgyi encoding) 593K 💾
pl Polish 7,148K 💾
ps Pashto 7,343K 💾
rm-puter Romansh (Puter) 1,068K 💾
rm-rumgr Romansh (Grischun) 4,794K 💾
rm-surmiran Romansh (Surmiran) 2,540K 💾
rm-sursilv Romansh (Sursilvan) 11,678K 💾
rm-sutsilv Romansh (Sutsilvan) 1,007K 💾
rm-vallader Romansh (Vallader) 5,560K 💾
ro Romanian 13,962K 💾
ru Russian 6,216K² 💾
rw Kinyarwanda 605K 💾
shn Shan 1,435K 💾
si Sinhala 1,046K 💾
sn Shona 2,542K 💾
so Somali 874K 💾
sq Albanian 10,104K 💾
sr-Latn Serbian (Latin) 10,143K 💾
sv Swedish 33,633K 💾
sw Swahili 8,817K 💾
ta Tamil 1,413K 💾
taq Tamasheq 66K 💾
ti Tigrinya 803K 💾
tr Turkish 13,846K 💾
ug Uyghur 9,493K 💾
uk Ukrainian 12,921K 💾
ur Urdu 3,622K 💾
yo Yoruba 80K 💾

¹ To count tokens, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO. Downloadable files include counts for each token. To get the raw text, run the crawler yourself.

² Crawl is still in progress; the final number will be larger.

Running the Crawler

./corpuscrawler --language=rm --output=./corpus