Github: https://github.com/chaurasiauttkarsh/Language-Identification-n-grams-dnn
Dense Neural Network is trained on character level n-grams to build langauge identification model.
WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs. The following is the subset of dataset used:
⦁ English ⦁ Arabic ⦁ French ⦁ Hindi ⦁ Urdu ⦁ Portuguese ⦁ Persian ⦁ Pushto ⦁ Spanish ⦁ Korean ⦁ Tamil ⦁ Turkish ⦁ Estonian ⦁ Russian ⦁ Romanian ⦁ Chinese ⦁ Swedish ⦁ Latin ⦁ Indonesian ⦁ Dutch ⦁ Japanese ⦁ Thai
- Data: Contains dataset used for language identification
-dataset.csv - Model: Contains model and parameter files used for language identification
-model.h5
-parameters.sav - PDF Google Colab Output: This contains the output obtained in a pdf format
-Identification.pdf
-Training Model.pdf - ColabNotebook: Contains output of training and identification. Link to the colab files is also present
-Language_Identification.ipynb
-Language_Identification_Prediction.ipynb - predict-local.py: this file is used to make predictions in local system(environment should be created and dependencies needs to be installed before running this file)
Required Dependencies:
- tensorflow r2.6
- pickle
- numpy
- re
- pandas
(It is recommended to run ColabNotebook/Language_Identification_Prediction.ipynb file on google colab where latest dependencies are taken care of)
Please use larger copus for language identification purpose in the case of syntactically similar languages.