/neuralspace-examples

Example usecases, scripts and datasets for getting started with the NeuralSpace platform

Primary LanguagePythonApache License 2.0Apache-2.0

NeuralSpace Examples

In this repo you will find various datasets, scripts, etc. that will help you understand our APIs better and faster.

Datasets

Natural Language Understanding (NLU)

We formatted various openly available NLU datasets so that they can be directly used on our platform. These datasets can be found in the datasets/nlu folder. There are different sub-folders for different languages.

Information regarding the datasets can be found in the table below.
To refer to the citations of these datasets, kindly see citation information.

Dataset Name Languages License Desciption
Hard Arabic 93700 hotel reviews from booking.com
Miam English, French, German, Italian, Spanish CC BY-SA 4.0 Cover a variety of domains including spontaneous speech, scripted scenarios, and joint task completion
Ask Ubuntu English CC BY-SA 3.0 162 questions and answers from https://askubuntu.com.
Chatbot Corpus English CC BY-SA 3.0 206 questions from a Telegram chatbot for public transport in Munich
Web Application Corpus English CC BY-SA 3.0 89 questions and answers from https://webapps.stackexchange.com.
Hope_edi English, Tamil, Malayalam CC BY-SA 4.0 A Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube.
Atis English Apache-2.0 License word sequences with IOB slot tags and the intent label
Snips English Apache-2.0 License word sequences with IOB slot tags and the intent label
Multilingual Task Oriented English, Spanish
It Helpdesk English
Allocine French MIT License French-language dataset for sentiment analysis
Flue French CC BY-SA 4.0 FLUE is an evaluation setup for French NLP systems similar to the popular GLUE benchmark
Facebook Post Aggression Identification Hindi, Hinglish CC-BY-NC-SA 4.0 Dataset with 3-way classification between ’Overtly Aggressive (OAG)’, ’Covertly Aggressive (CAG)’ and ’Non-aggressive (NAG)’ over text data
Ilist Hindi, Braj Bhasha, Awadhi, Bhojpuri, Magahi Apache-2.0 License This datasets is introduced in a task which aimed at identifying 5 closely-related languages of Indo-Aryan language family – Hindi (also known as Khari Boli), Braj Bhasha, Awadhi, Bhojpuri and Magahi
Dravidian Codemix HASOC 2020 Tanglish, Manglish (Tamil and Malayalam written in Roman Scripts) The data set has been collected from YouTube comments and Tweets. Each comment/post is annotated with offensive language label at the comment/post level.
Telugu News Telugu This dataset contains Telugu language news articles along with respective topic labels (business, editorial, entertainment, nation, sport) extracted from the daily Andhra Jyoti
Profanity Turkish Annotation follows the hierarchical tagset proposed in the Offensive Language Identification Dataset (OLID)
Banking77 English CC BY-SA 4.0 The dataset is based on the banking domain and has 77 intents
SMP2019 Chinese The dataset is based on 29 domains, including: app, email...
Rasa Dataset Chinese Chinese The dataset is based on rasa dataset translated to Chinese
JointDSF Vietnamese GNU Affero General Public License v3.0 The dataset is based on ATIS dataset translated to Vietnamese
Urdu Fake News Urdu The dataset is based on fake news detection in Urdu taken from Hugging Face
Malayalam News Classification Malayalam CC BY-SA 4.0 The dataset is based on news classification in Malayalam language from AI4Bharat
Marathi News Classification Malayalam CC BY-SA 4.0 The dataset is based on news classification in Marathi language from AI4Bharat

Note

NeuralSpace does not own any rights to these datasets and these are not for commercial use. Licenses of each of these datasets will be added here soon.