Natural Language Processing For Text Language Identification

Summary

In this project, we create an Artificial Neural Network (ANN) that takes in text as input and identifies the language the input is in. We implement this using Python and its various open-source libraries that allow us to preprocess our data, create the model, train the model, test the model and deploy the model.

Project Report

Introduction

Natural language refers to how humans communicate with one another. Namely, speech and text. The task of identifying natural language appears frequently in web applications. Users want websites that are relevant to them, and search engines want to help them find them. Content that is difficult to understand is automatically deemed less relevant. Knowing the source language is critical for machine translation, sentiment analysis, and text summarization algorithms.

There are several approaches to developing a program that can recognize the language of a certain text document or audio, but machine learning appears to be the most efficient and accurate.

The Goal of the Study

The project's goal is to create a model/program that can take any text as input and return an output that identifies the language it is in.

Dataset

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs. We use a dataset that contains 22 selective languages from the original dataset which includes the following Languages

  • English
  • Arabic
  • French
  • Hindi
  • Urdu
  • Portuguese
  • Persian
  • Pushto
  • Spanish
  • Korean
  • Tamil
  • Turkish
  • Estonian
  • Russian
  • Romanian
  • Chinese
  • Swedish
  • Latin
  • Indonesian
  • Dutch
  • Japanese
  • Thai

Audio Language Identification Model

NLP-Audio-Language-Identification