We are excited to announce the release of our Korean Audio-Visual Speech Recognition (AVSR) Dataset. This dataset has been collected from a wide range of YouTube videos containing dialoges of multiple participants and is intended for researchers and developers working on audio-visual speech recognition and related fields.
This dataset aims to facilitate advancements in the field of AVSR by providing a diverse and challenging collection of samples. By making this dataset publicly available, we hope to foster collaboration, stimulate new ideas, and contribute to the community's understanding of multimodal speech recognition.
The Multimodal AVSR Dataset includes:
- 83 hours of audio-visual content spoken in Korean
- Videos sourced from YouTube, covering various topics and domains
- Various numbers of participants in each video
- Varied background noise levels and acoustic environments
- High-quality manually verified transcriptions for each video
Each data entry in the dataset consists of:
- URLs to video files
- Metadata for AVSR containing transcription and face region of active speaker for each utterance
- Set of scripts to extract videos of audio-visual utterances from metadata
To ensure the responsible use of this dataset, please follow these guidelines:
- Attribution : Please credit the creators of the dataset by linking back to this repository in your research publications or project documentation.
- Privacy : Do not use the dataset to identify or infer sensitive information about individuals featured in the videos.
- Non-commercial use : This dataset is made available for academic and research purposes only. Commercial use of the dataset is prohibited.
To access the dataset, clone this repository and follow the Data Preparation guidelines for dataset preprocessing.
- This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- The source code is licensed under the MIT License. See LICENSE for details.
If you use this dataset in your research, please cite it as follows:
@misc{kmsav,
title={KMSAV: Korean Multi-speaker Spontaneous Audio-Visual Speech Recognition Dataset},
author={Kiyoung Park, Changhan Oh and Sunghee Dong},
year={2024},
journal={ETRI Journal},
}
This work is supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2019-0-01376, Development of the multi-speaker conversational speech recognition technology)
For any questions or concerns related to this dataset, please reach out to us by opening an issue on this repository or by contacting us at pkyoung@etri.re.kr .