A web app, created using Streamlit, for the task of language identification implemented using two different algorithms, one using a graph based n-gram approach,and the other, a character level lstm model. Languages identified: English, German, French, Dutch, Spanish, Italian
Dataset used for this task was: http://www.win.tue.nl/~mpechen/projects/smm/. A total of 7200 (1200 for each class) samples were used for training and 1200 (200 for each class) for testing
- An n-gram approach where n-grams (here, n=3) are nodes of a directed graph.
- Each vertex captures the frequency of the tri-gram for all langauges.
- Each edge captures the tri-gram order for all languages and assigns it a weight depending on the frequency.
- At inference time, text is broken into same order n-grams as the trained graph, and scores for the edges and vertices are added to the respective language score.
- The scores are normalized, each being in a range of [0,2] (Since, both edge and vertex scores are added, max score is 2).
- Language with the maximum score is the correct language for the text.
- Acheived an accuracy of 94 % on the test dataset.
- A character level Bi-lstm implemented in PyTorch, for sequences of 90 characters.
- An embedding size of 300 is used for the characters, with their being 61 total characters in our vocabulary.
- For cross validation, a 5 fold cross validation technique was used, with the accuracy for testing being 96 %.
A step by step series of examples that tell you how to get the app running. In your cmd:
git clone https://github.com/talha1503/Language-Identification.git
Then,
cd Language-Identification
pip install -r requirements.txt
streamlit run main.py