Intonation-aided intention identification for Korean
fastText, Keras (TensorFlow), Numpy, Librosa
Currently available for python 3.5 and upper version is in implementation
Pretrained 100dim fastText vector
- Download this and unzip THE .BIN FILE in the NEW FOLDER named 'vectors'
- This can be replaced with whatever model the user employs, but it requires an additional training.
(20.10.13) We updated our LICENSE to CC-BY-SA-4.0 and rearranged our guideline (in Korean) so as to be used officially. The international version is to be prepared with the publication.
(19.06.06) We provide train & validation, and test set separately here, for an easier Keras-based implementation.
The (train+validation) : test ratio is 9 : 1, and train : validation ratio is also 9 : 1 (thus, in total, 0.81 : 0.09 : 0.1).
(19.02.28) A Renewed version of the final corpus
The renewed version of the corpus is uploaded along with the models. May not be changed unless severe defect is observed.
We've found a few misclassified utterances and undergoing modification, thus the true-final version will be disclosed before Fabrary. Pilot implemenation of the system (e.g., as tutorial) is less involved with this problem, but do not cite this dataset as a benchmark until Fabrary. The notice will be available as soon as possible.
The next version will incoporate much more utterances and will be treated as a separate dataset.
FCI: A seven-class text corpus for the classification of conversation-style and non-canonical Korean utterances
- F: Fragments (nouns, noun phrases, incomplete sentences etc.) (FRs)
- C: Clear-cut cases (statements, questions, commands, rhetorical questions, rhetorical commands) (CCs)
- I: Intonation-dependent utterances (IUs)
- IAA: kappa = 0.85 for Corpus 1
- Data for FCI module is labeled in 0-6, split in train:test with ratio 9:1.
- Available in data folder.
- Easy start: Demonstration.exe
python3 3i4k_demo.py
- Given only a text input, the system classifies the input into one the aforementioned 7 categories. Available in demo.
- Text classification is also available in demo; a corpus (input: filename without '.txt') can be categorized into 7 classes.
- Available by importing module
from classify import pred_only_text('sentence_you_choose')
from classify import classify_document('filename_you_choose')
- Available by importing module
from classify import pred_audio_text('speechfilename_you_choose', 'sentence_you_choose')
The annotation guideline (in Korean) (previous version is here) was elaborately constructed by Won Ik Cho, with the great help of Ha Eun Park and Dae Ho Kook. Also, the authors appreciate Jong In Kim, Kyu Hwan Lee, and Jio Chung from SNU Spoken Language Processing laboratory (SNU SLP) for providing the useful corpus for the analysis. We note that this work was supported by the Technology Innovation Program (10076583, Development of free-running speech recognition technologies for embedded robot system) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea).
For the utilization of the word vector dictionary, cite the following:
@article{cho2018real,
title={Real-time Automatic Word Segmentation for User-generated Text},
author={Cho, Won Ik and Cheon, Sung Jun and Kang, Woo Hyun and Kim, Ji Won and Kim, Nam Soo},
journal={arXiv preprint arXiv:1810.13113},
year={2018}
}
For the utilization of the annotation guideline or the dataset, cite the following:
@article{cho2018speech,
title={Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependency},
author={Cho, Won Ik and Lee, Hyeon Seung and Yoon, Ji Won and Kim, Seok Min and Kim, Nam Soo},
journal={arXiv preprint arXiv:1811.04231},
year={2018}
}