Description of Dataset

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs.

After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages

English
Arabic
French
Hindi
Urdu
Portuguese
Persian
Pushto
Spanish
Korean
Tamil
Turkish
Estonian
Russian
Romanian
Chinese
Swedish
Latin
German
Dutch
Japanese
Thai

Description of Repo!!

This is a Language Identification tool deployed on Flask . It is a simple Language prediction tool so don't mind if it gives you wrong results but it works real fine and have long way to go. The project evaluates the result for different model on 3 different algorithm.

The project have 2 sav file:

unigram_model.sav which has logistic regression as a classification algorithm. it is a unigram feature model.
unigram_model_rfc.sav which has random forest classifier as a classification algorithm.

The web folder contains the main code to run the server. you can run it by following command:

python3 main.py

The Requirements have been added in requirements.txt . LI is the virtual environment which you can use for setting uo the project. Data folder contains the dataset used in the project. You may see the project demo here

Setting Up the Project in your machine

Fork the github repo to create a copy in you account.
Clone the repo

git clone https://github.com/honeybhardwaj/Language_Identification.git

Activate the virtual environment

source LI/bin/activate

Install Dependencies

pip3 install -r requirements.txt

run server by going into the derectory

cd Web
python3 main.py

Contratulations!! everything is up for development. go ahead and contribute... contact me if you have any doubts. generate issues before contributing.

honeybhardwaj/Language_Identification

Description of Dataset

Description of Repo!!

Setting Up the Project in your machine

How it Looks

Prediction Images

Project Admin ❤️

Happy Coding 👨‍💻

please don't forget to give a star ⭐ if you liked it.