NeuralSpace Examples

In this repo you will find various datasets, scripts, etc. that will help you understand our APIs better and faster.

Datasets

Natural Language Understanding (NLU)

We formatted various openly available NLU datasets so that they can be directly used on our platform. These datasets can be found in the datasets/nlu folder. There are different sub-folders for different languages.

Information regarding the datasets can be found in the table below.
To refer to the citations of these datasets, kindly see citation information.

Dataset Name	Languages	License	Desciption
Hard	Arabic		93700 hotel reviews from booking.com
Miam	English, French, German, Italian, Spanish	CC BY-SA 4.0	Cover a variety of domains including spontaneous speech, scripted scenarios, and joint task completion
Ask Ubuntu	English	CC BY-SA 3.0	162 questions and answers from https://askubuntu.com.
Chatbot Corpus	English	CC BY-SA 3.0	206 questions from a Telegram chatbot for public transport in Munich
Web Application Corpus	English	CC BY-SA 3.0	89 questions and answers from https://webapps.stackexchange.com.
Hope_edi	English, Tamil, Malayalam	CC BY-SA 4.0	A Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube.
Atis	English	Apache-2.0 License	word sequences with IOB slot tags and the intent label
Snips	English	Apache-2.0 License	word sequences with IOB slot tags and the intent label
Multilingual Task Oriented	English, Spanish
It Helpdesk	English
Allocine	French	MIT License	French-language dataset for sentiment analysis
Flue	French	CC BY-SA 4.0	FLUE is an evaluation setup for French NLP systems similar to the popular GLUE benchmark
Facebook Post Aggression Identification	Hindi, Hinglish	CC-BY-NC-SA 4.0	Dataset with 3-way classification between ’Overtly Aggressive (OAG)’, ’Covertly Aggressive (CAG)’ and ’Non-aggressive (NAG)’ over text data
Ilist	Hindi, Braj Bhasha, Awadhi, Bhojpuri, Magahi	Apache-2.0 License	This datasets is introduced in a task which aimed at identifying 5 closely-related languages of Indo-Aryan language family – Hindi (also known as Khari Boli), Braj Bhasha, Awadhi, Bhojpuri and Magahi
Dravidian Codemix HASOC 2020	Tanglish, Manglish (Tamil and Malayalam written in Roman Scripts)		The data set has been collected from YouTube comments and Tweets. Each comment/post is annotated with offensive language label at the comment/post level.
Telugu News	Telugu		This dataset contains Telugu language news articles along with respective topic labels (business, editorial, entertainment, nation, sport) extracted from the daily Andhra Jyoti
Profanity	Turkish		Annotation follows the hierarchical tagset proposed in the Offensive Language Identification Dataset (OLID)
Banking77	English	CC BY-SA 4.0	The dataset is based on the banking domain and has 77 intents
SMP2019	Chinese		The dataset is based on 29 domains, including: app, email...
Rasa Dataset Chinese	Chinese		The dataset is based on rasa dataset translated to Chinese
JointDSF	Vietnamese	GNU Affero General Public License v3.0	The dataset is based on ATIS dataset translated to Vietnamese
Urdu Fake News	Urdu		The dataset is based on fake news detection in Urdu taken from Hugging Face
Malayalam News Classification	Malayalam	CC BY-SA 4.0	The dataset is based on news classification in Malayalam language from AI4Bharat
Marathi News Classification	Malayalam	CC BY-SA 4.0	The dataset is based on news classification in Marathi language from AI4Bharat

Note

NeuralSpace does not own any rights to these datasets and these are not for commercial use. Licenses of each of these datasets will be added here soon.

Neural-Space/neuralspace-examples

NeuralSpace Examples

Datasets

Natural Language Understanding (NLU)

Note