Spoken language understanding (SLU) is a critical component in task-oriented dialogue systems. It usually consists of intent and slot filling task to extract semantic constituents from the natrual language utterances.

Name Intro Links Multi/Single Turn(M/S) Detail Size & Stats Label
ATIS 1. The ATIS (Airline Travel Information Systems) dataset (Tur et al., 2010) is widely used in SLU research 2. For natural language understanding Download: 1.https://github.com/yizhen20133868/StackPropagation-SLU/tree/master/data/atis 2.https://github.com/yvchen/JointSLU/tree/master/data Paper: https://www.aclweb.org/anthology/H90-1021.pdf S Airline Travel Information However, this data set has been shown to have a serious skew problem on intent Train: 4478 Test: 893 120 slot and 21 intent Intent Slots
SNIPS 1. Collected by Snips for model evaluation. 2. For natural language understanding 3. Homepage: https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems-google-facebook-microsoft-and-snips-2b8ddcf9fb19 Download: https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines Paper: https://arxiv.org/pdf/1805.10190.pdf S 7 task: Weather,play music, search, add to list, book, moive Train:13,084 Test:700 7 intent 72 slot labels Intent Slots
Facebook Multilingual SLU Dataset 1 Contains English, Spanish, and Thai across the weather, reminder, and alarm domains 2 For cross-lingual SLU Download: https://fb.me/multilingual_task_oriented_data Paper: https://www.aclweb.org/anthology/N19-1380.pdf S Utterances are manually translated and annotated Train: English 30,521; Spanish 3,617; Thai 2,156 Dev: English 4,181; Spanish 1,983; Thai 1,235 Test: English 8,621; Spanish 3,043; Thai 1,692 11 slot and 12 intent Intent Slots
MIT Restraunt Corpus MIT corpus contains train set and test set in BIO format for NLU Download: https://groups.csail.mit.edu/sls/downloads/restaurant/ S It is a single-domain dataset, which is associated with restaurant reservations. MR contains ‘open-vocabulary’ slots, such as restaurant names Train:7760 Test:1521 Slots
MIT Movie Corpus The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format. The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries. Download: https://groups.csail.mit.edu/sls/downloads/movie/ S The MIT movie corpus consists of two single-domain datasets: the movie eng (ME) and movie trivia (MT) datasets. While both datasets contain queries about film information, the trivia queries are more complex and specific eng Corpus: Train:9775 Test:2443 Trivia Corpus: Train:7816 Test:1953 Slots
Multilingual ATIS ATIS was manually translated into Hindi and Turkish Download: It has been put into LDC, and you can download it if you are own a membership or pay for it Paper: http://shyamupa.com/papers/UFTHH18.pdf S 3 languages On the top of ATIS dataset, 893 and 715 utterances from the ATIS test split were translated and annotated for Hindi and Turkish evaluation respectively also translated and annotated 600(each language separately) utterances from the ATIS train split to use as supervision In total 37,084 training examples and 7,859 test examples Intent Slots
Multilingual ATIS++ Extends Multilingual ATIS corpus to nine languages across four language families Download: contact multiatis@amazon.com. Paper: https://arxiv.org/abs/2004.14353 S 10 languages check the paper to find the full table of description (to many info ,have no enough space here) Intent Slots
Almawave-SLU 1. A dataset for Italian SLU 2. Was generated through a semi-automatic procedure from SNIPS Download: contact [first name initial\].[last name]@almawave.it for the dataset (any author in this paper) Paper: https://arxiv.org/pdf/1907.07526.pdf S 6 domains: Music, Restaurants, TV, Movies, Books, Weather Train: 7,142 Validation: 700 Test: 700 7 intents and 39 slots Intent Slots
Chatbot Corpus 1. Chatbot Corpus is based on questions gathered by a Telegram chatbot which answers questions about public transport connections, consisting of 206 questions 2. For intent classification test Download: https://github.com/sebischair/NLU-Evaluation-Corpora Paper: https://www.aclweb.org/anthology/W17-5522.pdf S 2 Intents: Departure Time, Find Connection 5 entity types: StationStart, StationDest, Criterion, Vehicle, Line Train: 100 Test: 106 Intent Entity
StackExchange Corpus 1. StackExchange Corpus is based on data from two StackExchange platforms: ask ubuntu and Web Applications 2. Gathers 290 questions and answers in total, 100 from Web Applications and 190 from ask ubuntu 3. For intent classification test Download: https://github.com/sebischair/NLU-Evaluation-Corpora Paper: https://www.aclweb.org/anthology/W17-5522.pdf S Ask ubuntu Intents: “Make Update”, “Setup Printer”, “Shutdown Computer”, and “Software Recommendation” Web Applications Intents: “Change Password”, “Delete Account”, “Download Video”, “Export Data”, “Filter Spam”, “Find Alternative”, and “Sync Accounts” Total: 290 Ask ubuntu: 190 Web Application: 100 Intent Entity
MixSNIPS/MixATIS multi-intent dataset based on SNIPS and ATIS Download: https://github.com/LooperXX/AGIF/tree/master/data Paper: https://www.aclweb.org/anthology/2020.findings-emnlp.163.pdf S using conjunctions, connecting sentences with different intents forming a ratio of 0.3,0.5 and 0.2 for sentences has which 1,2 and 3 intents, respectively Train:12,759 utterances Dev:4,812 utterances Test:7,848 utterances Intent(Multi),Slots
TOP semantic parsing 1,Hierarchical annotation scheme for semantic parsing 2,Allows the representation of compositional queries 3,Can be efficiently and accurately parsed by standard constituency parsing models Download: http://fb.me/semanticparsingdialog Paper: https://www.aclweb.org/anthology/D18-1300.pdf S focused on navigation, events, and navigation to events evaluation script can be run from evaluate.py within the dataset 44783 annotations Train:31279 Dev:4462 Test:9042 Inten ,Slots in Tree format
MTOP: Multilingual TOP 1.An almost-parallel multilingual task-oriented semantic parsing dataset covering 6 languages and 11 domains. 2.the first multilingual dataset that contain compositional representations that allow complex nested queries. 3.the dataset creation: i) generating synthetic utterances and annotating in English, ii) translation, label transfer, post-processing, post editing and filtering for other languages Download: https://fb.me/mtop_dataset Paper: https://arxiv.org/pdf/2008.09335.pdf S 6 languages (both high and low resource): English, Spanish, French, German, Hindi and Thai. a mix of both simple as well as compositional nested queries across 11 domains, 117 intents and 78 slots. 100k examples in total for 6 languages. Roughly divided into 70:10:20 percent splits for train,eval and test. Two kinds of representations: 1.flat representatiom: Intent and slots 2.compositional decoupled representations:nested intents inside slots More details 3.2 section in the paper
CAIS Collected from real world speaker systems with manual annotations of slot tags and intent labels [https://github.com/Adaxry/CM-Net](https://github.com/Adaxry/CM-Net/tree/master/CAIS) S 1.The utterances were collected from the Chinese Artificial Intelligence Speakers 2.Adopt the BIOES tagging scheme for slots instead of the BIO2 used in the ATIS 3.intent labels are partial to the PlayMusic option Train: 7,995 utterances Dev: 994 utterances Test: 1024 utterances slots tags and intent labels
Simulated Dialogues dataset machines2machines (M2M) Download: https://github.com/google-research-datasets/simulated-dialogue Paper: http://www.colips.org/workshop/dstc4/papers/60.pdf M Slots: Sim-R (Restaurant) price_range, location, restaurant_name, category, num_people, date, time Sim-M (Movie) theatre_name, movie, date, time, num_people Sim-GEN (Movie):theatre_name, movie, date, time, num_people Train: Sim-R:1116 Sim-M:384 Sim-GEN:100k Dev: Sim-R:349 Sim-M:120 Sim-GEN:10k Test: Sim-R:775 Sim-M:264 Sim-GEN:10k Dialogue state User's act,slot,intent System's act,slot
Schema-Guided Dialogue Dataset(SGD) dialogue simulation(auto based on identified scenarios), word-replacement and human intergration as paraphrasing Download: https://github.com/google-research-datasets/dstc8-schema-guided-dialogue Paper: https://arxiv.org/pdf/1909.05855.pdf M domains:16,dialogues:16142,turns:329964,acg turns per dialogue:20.44,total unique tokens:30352,slots:214,slot values:14319 NA Scheme Representation: service_name;description;slot's name,description,is_categorial,possible_values;intent's name,description,is_transactional,required_slots,optional_slots,result_slots. Dialogue Representation: dialogue_id,services,turns,speaker,utterance,frame,service,slot's name,start,exclusive_end;action's act,slot,values,canonical_values;service_call's method,parameters;service_results,state's active_intent,requested_slots,slot_values
CLINC150 A intent classification (text classification) dataset with 150 in-domain intent classes. The main purpose of this dataset is to evaluate various classifiers on out-of-domain performance. Download: https://archive.ics.uci.edu/ml/datasets/CLINC150 Paper: https://www.aclweb.org/anthology/D19-1131/ S data_full.json: 150 in-domain intent classes 100 train, 20 val, and 30 test samples while out-of-domain 100 train, 100 val, and 1,000 test samples, data_small.json: in-domain 50 train, 20 val, and 30 test, out-domain 100 train, 100 val, and 1,000 test samples. data_imbalanced.json: in-domain intent classes 25, 50, 75, or 100 train, 20 val, and 30 samples while out-of-domain class has 100 train, 100 val, and 1,000 test samples. data_oos_plus.json: same as data_full.json except there are 250 out-of-domain training samples. size 23700 intent 150 Intent(in-domain, out-domain)
HWU64 Download: https://github.com/xliuhw/NLU-Evaluation-Data Paper: https://arxiv.org/pdf/1903.05566.pdf S 21 domains,inter alia,music, news,calendar size 25716, intents 64, slots 54 Intent detection;Entity extraction
Banking-77 BANKING77 dataset provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. Download: github.com/PolyAI-LDN/polyai-models Paper: https://arxiv.org/pdf/2003.04807.pdf S banking size 13083 intents 77 Intent detection
Restaurants-8K A new challenging data set of 8,198 utterances, compiled from actual conversations in the restaurant booking domain. Download: https://github.com/PolyAI-LDN/task-specific-datasets Paper: https://arxiv.org/pdf/2005.08866.pdf S restaurant booking size 11929 slots 5 Slot filling
ATIS in Chinese and Indonesian ATIS semantic dataset annotated in two new languages Download: http://statnlp.org/research/sp/ Paper: https://www.aclweb.org/anthology/P17-2007.pdf S airline travels size 5371 slot 120(166;lambda-calculus) Semantic parsing; Slot filling
Vietnamese ATIS Download : https://github.com/VinAIResearch/JointIDSF Paper : https://arxiv.org/pdf/2104.02021.pdf S airline travels size 5871 intent 25 slot 120 Intent detection, Slot filling.
xSID Translation of part of facebook and snips dataset Download : https://bitbucket.org/robvanderg/xsid Paper : https://aclanthology.org/2021.naacl-main.197.pdf S Languages: Arabic, Danish, South-Tyrolean, German, English, Indonesian, Italian, Japanese, Kazakh, Dutch, Serbian, Turkish, Chinese. Intents: AddToPlaylist, BookRestaurant, PlayMusic, RateBook, SearchCreativeWork, SearchScreeningEvent, alarm/cancel_alarm, alarm/modify_alarm, alarm/set_alarm, alarm/show_alarms, alarm/snooze_alarm, reminder/cancel_reminder, reminder/set_reminder, reminder/show_reminders, weather/find. 500 test, 300 dev for each language. 43605 English train (automatic translation into all languages also provided) Intent detection, Slot filling.


