A dataset of 190 000 sentences categorized into 10 languages, primarily for Language Detection and Benchmarking NLP Algorithms. This repository containes the dataset and code for processing it.
This dataset is meant for use by researchers aiming to use machine learning techniques to build automatic language detection algorithms.
It is also aimed at Evaluating the general effectiveness of newly developed Techniques for Natural Language Processing. This is a well organized Benchmark and we hope you find this useful in your research works.
Ling10 is released in three variants
- Ling10-trainlarge: Contains 140 000 sentences (14 000 per language class) for training and 50 000 sentences (5000 per language class) for testing
- Ling10-trainmedium: Contains 95 000 sentences (9500 per language class) for training and 95 000 sentences (9500 per language class) for testing
- Ling10-trainsmall: Contains 20 000 sentences (2000 per language class) for training and 170 000 sentences (17000 per language class) for testing. This is much more challenging dataset
Each Variant contains the following files
Contains sentences for training and integer labels representing the language classes Contains sentences for testing and integer labels representing the language classesBoth the train and test files are organized as sentence - label pairs with the tab characer "\t" separating them.
- English
- French
- Russian
- Chinese Mandarin
- Hebrew
- Portugese
- Polish
- Dutch
- Japanese
- Italian
Check The Release to Download The datasets
Check our example classification script in keras.Keras Language Classification
You can reach to us via our contacts below:
John Olafenwa
Website: https://john.specpal.science
Twitter: @johnolafenwa
Medium : @johnolafenwa
Facebook : olafenwajohn
Moses Olafenwa
Website: https://moses.specpal.science
Twitter: @OlafenwaMoses
Medium : @guymodscientist
Facebook : moses.olafenwa