Language-Identification-n-grams-dnn

Github: https://github.com/chaurasiauttkarsh/Language-Identification-n-grams-dnn

Dense Neural Network is trained on character level n-grams to build langauge identification model.

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs. The following is the subset of dataset used:

⦁ English ⦁ Arabic ⦁ French ⦁ Hindi ⦁ Urdu ⦁ Portuguese ⦁ Persian ⦁ Pushto ⦁ Spanish ⦁ Korean ⦁ Tamil ⦁ Turkish ⦁ Estonian ⦁ Russian ⦁ Romanian ⦁ Chinese ⦁ Swedish ⦁ Latin ⦁ Indonesian ⦁ Dutch ⦁ Japanese ⦁ Thai

Data: Contains dataset used for language identification
-dataset.csv
Model: Contains model and parameter files used for language identification
-model.h5
-parameters.sav
PDF Google Colab Output: This contains the output obtained in a pdf format
-Identification.pdf
-Training Model.pdf
ColabNotebook: Contains output of training and identification. Link to the colab files is also present
-Language_Identification.ipynb
-Language_Identification_Prediction.ipynb
predict-local.py: this file is used to make predictions in local system(environment should be created and dependencies needs to be installed before running this file)
Required Dependencies:

tensorflow r2.6
pickle
numpy
re
pandas
(It is recommended to run ColabNotebook/Language_Identification_Prediction.ipynb file on google colab where latest dependencies are taken care of)

Please use larger copus for language identification purpose in the case of syntactically similar languages.