/language-identification

This repository presents an approach to predict the language in which a document is written. In particular, the proposed approach transforms a text into character n-gram features and uses them to support the predictive power of a machine-learned classifier. Experimental results show that it is capable of identifying 14 languages with high accuracy and that its performance is better than that of some of the most popular language identification libraries in the Python ecosystem.

Primary LanguageJupyter NotebookMIT LicenseMIT

Stargazers