language-identifier
Identifies the language of given text file.
Implementation of this paper:
Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Ann arbor mi 48113.2 (1994): 161-175.
Dataset used for experiments:
The above link provides two folders DLI32 and DLI32-2. DLI32 used as train set to produce language profiles for languages.
When tested on DLI32-2, achieved an accuracy of 60% for language profiles of length 50.
Usage
For training:
-
Download the datasets. Move the dataset and the code into one folder.
-
To train, run
python identifier.py
This generates lprofiles.pkl
pickle file which contains the language profiles for 32 langauges.
- To test, run
python test.py
Requirements -Python -NLTK