(ILI) Indo-Aryan Language Identification Shared Task Dataset

This repository contains the dataset used for Indo-Aryan Language identitifcation Shared Task as part of the Evaluation Campaign in the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) at COLING 2018.

This task was aimed at identifying 5 closely-related languages of Indo-Aryan language family – Hindi (also known as Khari Boli), Braj Bhasha, Awadhi, Bhojpuri and Magahi. These languages form part of a continuum starting from Western Uttar Pradesh (Hindi and Braj Bhasha) to Eastern Uttar Pradesh (Awadhi and Bhojpuri) and the neighbouring Eastern state of Bihar (Bhojpuri and Magahi). For this task, participants were provided with a dataset of approximately 15,000 sentences in each language, mainly from the domain of literature, published over the web as well as in print. It is the first dataset that is being made available for these languages (except Hindi) and it will not only be useful for automatic identification of languages and developing NLP applications but will also help in gaining insights into the proximity level of these languages (which are hypothesised to form part of a continuum and lot of times mistaken as varieties of Hindi, especially outside the scholarly linguistic circles).

Dataset Format

The directory 'dataset' contains 3 files - the ones used for training, development / validation and testing in the shared task. Each file contains data in one sentence one line format with the language ID separated by a TAB. The langauge IDs are the ISO codes of the respective languages and should be read as below -

AWA = Awadhi

BRA = Braj

BHO = Bhojpuri

MAG = Magahi

HIN = Hindi

The VarDial Task Page

Task Organizers: Ritesh Kumar (Bhim Rao Ambedkar University, India), Bornini Lahiri (Jadavpur University, India), and Mayank Jain (Jawaharlal Nehru University, India)

kmi-linguistics/vardial2018

(ILI) Indo-Aryan Language Identification Shared Task Dataset

Dataset Format

The VarDial Task Page

Task Organizers: Ritesh Kumar (Bhim Rao Ambedkar University, India), Bornini Lahiri (Jadavpur University, India), and Mayank Jain (Jawaharlal Nehru University, India)

Proceedings of the workshop and systems description papers are available here

The results of all the tasks of the campaign is available here

For any queries / comments, please write to linguistics.kmi@gmail.com