optimaize/language-detector

Source of language corpus

Opened this issue · 0 comments

Where is the source text dataset for the Ngrams of those 70 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

"There are two kinds of profiles. The standard ones created from Wikipedia articles and similar. And the "short text" profiles created from Twitter tweets."