WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. Each language in this dataset contains 1000 rows/paragraphs.
After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages
- English
- Arabic
- French
- Hindi
- Urdu
- Portuguese
- Persian
- Pushto
- Spanish
- Korean
- Tamil
- Turkish
- Estonian
- Russian
- Romanian
- Chinese
- Swedish
- Latin
- German
- Dutch
- Japanese
- Thai
This is a Language Identification tool deployed on Flask . It is a simple Language prediction tool so don't mind if it gives you wrong results but it works real fine and have long way to go. The project evaluates the result for different model on 3 different algorithm.
The project have 2 sav file:
- unigram_model.sav which has logistic regression as a classification algorithm. it is a unigram feature model.
- unigram_model_rfc.sav which has random forest classifier as a classification algorithm.
The web folder contains the main code to run the server. you can run it by following command:
python3 main.py
The Requirements have been added in requirements.txt . LI is the virtual environment which you can use for setting uo the project. Data folder contains the dataset used in the project. You may see the project demo here
- Fork the github repo to create a copy in you account.
- Clone the repo
git clone https://github.com/honeybhardwaj/Language_Identification.git
- Activate the virtual environment
source LI/bin/activate
- Install Dependencies
pip3 install -r requirements.txt
- run server by going into the derectory
cd Web
python3 main.py
Contratulations!! everything is up for development. go ahead and contribute... contact me if you have any doubts. generate issues before contributing.