language-identifier

Identifies the language of given text file.

Implementation of this paper:

Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Ann arbor mi 48113.2 (1994): 161-175.

Dataset used for experiments:

https://github.com/xprogramer/DLI32-corpus

The above link provides two folders DLI32 and DLI32-2. DLI32 used as train set to produce language profiles for languages.

When tested on DLI32-2, achieved an accuracy of 60% for language profiles of length 50.

For training:

This generates lprofiles.pkl pickle file which contains the language profiles for 32 langauges.

Requirements -Python -NLTK

Sanny26/language-identifier