/Language-Identification

Language Identification web app using a graph based n-gram approach and a character level lstm.

Primary LanguageJupyter Notebook

Language Identification


Overview

A web app, created using Streamlit, for the task of language identification implemented using two different algorithms, one using a graph based n-gram approach,and the other, a character level lstm model. Languages identified: English, German, French, Dutch, Spanish, Italian


Web app


Dataset

Dataset used for this task was: http://www.win.tue.nl/~mpechen/projects/smm/. A total of 7200 (1200 for each class) samples were used for training and 1200 (200 for each class) for testing


Algorithms

a) LIGA:

  • An n-gram approach where n-grams (here, n=3) are nodes of a directed graph.
  • Each vertex captures the frequency of the tri-gram for all langauges.
  • Each edge captures the tri-gram order for all languages and assigns it a weight depending on the frequency.
  • At inference time, text is broken into same order n-grams as the trained graph, and scores for the edges and vertices are added to the respective language score.
  • The scores are normalized, each being in a range of [0,2] (Since, both edge and vertex scores are added, max score is 2).
  • Language with the maximum score is the correct language for the text.
  • Acheived an accuracy of 94 % on the test dataset.

b) LSTM:

  • A character level Bi-lstm implemented in PyTorch, for sequences of 90 characters.
  • An embedding size of 300 is used for the characters, with their being 61 total characters in our vocabulary.
  • For cross validation, a 5 fold cross validation technique was used, with the accuracy for testing being 96 %.

Results

LSTM

LIGA


Installation

A step by step series of examples that tell you how to get the app running. In your cmd:

git clone https://github.com/talha1503/Language-Identification.git

Then,

cd Language-Identification
pip install -r requirements.txt
streamlit run main.py