/Ling10

A dataset of 190 000 sentences categorized into 10 languages, primarily for Language Detection tasks. This repository containes the dataset and code for processing it.

Primary LanguagePythonMIT LicenseMIT

Ling10

A dataset of 190 000 sentences categorized into 10 languages, primarily for Language Detection and Benchmarking NLP Algorithms. This repository containes the dataset and code for processing it.


Purpose


This dataset is meant for use by researchers aiming to use machine learning techniques to build automatic language detection algorithms.
It is also aimed at Evaluating the general effectiveness of newly developed Techniques for Natural Language Processing. This is a well organized Benchmark and we hope you find this useful in your research works.

Structure


Ling10 is released in three variants
  • Ling10-trainlarge: Contains 140 000 sentences (14 000 per language class) for training and 50 000 sentences (5000 per language class) for testing
  • Ling10-trainmedium: Contains 95 000 sentences (9500 per language class) for training and 95 000 sentences (9500 per language class) for testing
  • Ling10-trainsmall: Contains 20 000 sentences (2000 per language class) for training and 170 000 sentences (17000 per language class) for testing. This is much more challenging dataset

Each Variant contains the following files

train_set.txt

Contains sentences for training and integer labels representing the language classes

test_set.txt

Contains sentences for testing and integer labels representing the language classes
Both the train and test files are organized as sentence - label pairs with the tab characer "\t" separating them.

chars.json

A single json file containing two arrays: "char_to_idx" mapping characters to Integers and "idx_to_char" mapping Integers to characters

languagemap.json

A json file mapping Integer labels to the languages they represent

Source

All sentences in this dataset were extracted from language translation files from ManyThings.org

Included Languages

  • English
  • French
  • Russian
  • Chinese Mandarin
  • Hebrew
  • Portugese
  • Polish
  • Dutch
  • Japanese
  • Italian

Check The Release to Download The datasets

Sample Code

Check our example classification script in keras.
Keras Language Classification

You can reach to us via our contacts below:



John Olafenwa
Website: https://john.specpal.science
Twitter: @johnolafenwa
Medium : @johnolafenwa
Facebook : olafenwajohn


Moses Olafenwa
Website: https://moses.specpal.science
Twitter: @OlafenwaMoses
Medium : @guymodscientist
Facebook : moses.olafenwa