Language-Identification-n-grams-dnn

Github: https://github.com/chaurasiauttkarsh/Language-Identification-n-grams-dnn

Dense Neural Network is trained on character level n-grams to build langauge identification model.

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs. The following is the subset of dataset used:

⦁ English ⦁ Arabic ⦁ French ⦁ Hindi ⦁ Urdu ⦁ Portuguese ⦁ Persian ⦁ Pushto ⦁ Spanish ⦁ Korean ⦁ Tamil ⦁ Turkish ⦁ Estonian ⦁ Russian ⦁ Romanian ⦁ Chinese ⦁ Swedish ⦁ Latin ⦁ Indonesian ⦁ Dutch ⦁ Japanese ⦁ Thai

  1. Data: Contains dataset used for language identification
    -dataset.csv
  2. Model: Contains model and parameter files used for language identification
    -model.h5
    -parameters.sav
  3. PDF Google Colab Output: This contains the output obtained in a pdf format
    -Identification.pdf
    -Training Model.pdf
  4. ColabNotebook: Contains output of training and identification. Link to the colab files is also present
    -Language_Identification.ipynb
    -Language_Identification_Prediction.ipynb
  5. predict-local.py: this file is used to make predictions in local system(environment should be created and dependencies needs to be installed before running this file)
    Required Dependencies:
  • tensorflow r2.6
  • pickle
  • numpy
  • re
  • pandas
    (It is recommended to run ColabNotebook/Language_Identification_Prediction.ipynb file on google colab where latest dependencies are taken care of)

Please use larger copus for language identification purpose in the case of syntactically similar languages.