

Dense Neural Network is trained on character level n-grams to build langauge identification model.

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs. The following is the subset of dataset used:

⦁ English ⦁ Arabic ⦁ French ⦁ Hindi ⦁ Urdu ⦁ Portuguese ⦁ Persian ⦁ Pushto ⦁ Spanish ⦁ Korean ⦁ Tamil ⦁ Turkish ⦁ Estonian ⦁ Russian ⦁ Romanian ⦁ Chinese ⦁ Swedish ⦁ Latin ⦁ Indonesian ⦁ Dutch ⦁ Japanese ⦁ Thai

  1. Data: Contains dataset used for language identification
  2. Model: Contains model and parameter files used for language identification
  3. PDF Google Colab Output: This contains the output obtained in a pdf format
    -Training Model.pdf
  4. ColabNotebook: Contains output of training and identification. Link to the colab files is also present
  5. this file is used to make predictions in local system(environment should be created and dependencies needs to be installed before running this file)
    Required Dependencies:
  • tensorflow r2.6
  • pickle
  • numpy
  • re
  • pandas
    (It is recommended to run ColabNotebook/Language_Identification_Prediction.ipynb file on google colab where latest dependencies are taken care of)

Please use larger copus for language identification purpose in the case of syntactically similar languages.