jacerong/language-identification

This repository presents an approach to predict the language in which a document is written. In particular, the proposed approach transforms a text into character n-gram features and uses them to support the predictive power of a machine-learned classifier. Experimental results show that it is capable of identifying 14 languages with high accuracy and that its performance is better than that of some of the most popular language identification libraries in the Python ecosystem.

Jupyter NotebookMIT

Stargazers

RStrydom
South Africa
sandi2382
Kaudata