This repository contains code scripts for training and evaluation of the OLKAVS dataset described in the paper OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset.
Sample #1 : {A,B,D,F,H} "그때 되게 한여름이어서 되게 뜨거웠거든요." |
Sample #2 : {A,C,E,G,I} "그래서 도서관엘 다시 들어갔어요 공부하기 위해서" |
The OLKAVS contains below.
- a total of 1,150 hours of audio
- a total of 5,750 hours of synced video from 9 different viewpoints
Those are from 1,107 Korean speakers in a studio setup with corresponding Korean transcriptions.
Yo can download The OLKAVS datasets from AIHub: 립리딩(입모양) 음성인식 데이터.
The folder structure of the OLKAVS dataset is as follows:
{root}/{group}/{subgroup}/{noise}/{specificity}/{gender_group}/{gender_subgroup}/{session_idx}.{extension}
root
: Root directorygroup
: Data grouped by usage. (e.g. Train, Validation, Test)subgroup
: Data are separated in subgroups randomly.noise
: Noise condition.specificity
: Specificity of speaker.gender_group
: Gender.gender_subgroup
: Data of each gender are separted in gender_subgroups randomly.session_idx
: Index of the 5-minutes-long recording session. This index follows the name rule at here.extension
: File extension.mp4
,wav
,json
for video, audio and label, respectively.
example
./원천데이터/TS1/소음환경1/C(일반인)/F(여성)/F(여성)_1/lip_J_1_F_02_C032_A_010.wav
The rule of naming file is as follows:
lip_{video_env}_{audio_noise}_{gender}_{age}_{specificity}{speakerID}_{video_angle}_{index}
video_env
J: indoor, K:outdooraudio_noise
1: No noise, 2: Indoor noise, 3: Indoor ambiance, 4: Traffic noise, 5:Construction-site noise, 6: Natural outdoor noisegender
F : Female, M : Maleage
1 : 10 - 19, 2 : 20 - 29, 3: 30 - 39, 4: 40 - 49, 5: 50 - 59, 6: 60 overspecificity
C : Common speaker , E: ExpertspeakerID
Identified number for speakervideo_angle
A : Frontal, B : Upper left, C : Left, D : Lower left, E : Lower center, F : Lower right, G : Right, H : Upper right, I : Upper centerindex
Index of the 5-minute-long recording session
├── dataSet
│ ├── description
│ ├── url
│ ├── version
│ └── year
│
├── Video_info
│ ├── video_Name
│ ├── video_Format
│ ├── video_Duration
│ ├── FPS
│ └── Resolution
│
├── Audio_info
│ ├── Audio_Name
│ ├── Audio_Format
│ ├── Audio_Duration
│ ├── Sampling_rate
│ └── Channel(s)
│
├── Audio_env
│ └── Noise
│
├── Video_env
│ ├── env
│ └── Angle
│
├── Sentence_info
│ ├── ID
│ ├── topic
│ ├── sentence_text
│ ├── start_time
│ └── end_time
│
├── speaker_info
│ ├── speaker_ID
│ ├── Specificity
│ ├── Gender
│ ├── Age
│ └── Accent
│
└── Bounding_box_info
├── Face_bounding_box
└── Lip_bounding_box
pip install -r requirements.txt
Preprocess the data.
Crop the lengths of audio and video by the temporal label. (start, end)
Then crop the video to the shape (96 96), by bounding box.
Finally generate label scripts for training or evaluation.
Preparation
Data folder should comply with this structure
Run Script
python preprocess.py --root_dir {ROOT_DIR} --src_dir {SOURCE_DIR} --label_dir {LABEL_DIR}
Generated Label Samples
{Video_filepath}\t{Audio_filepath}\t{Transcription}\t{Tokenized_Numbers}
./save/원천데이터/TS1/소음환경1/C(일반인)/F(여성)/F(여성)_1/lip_J_1_F_02_C032_A_011/2.mp4 ./save/원천데이터/TS1/소음환경1/C(일반인)/F(여성)/F(여성)_1/lip_J_1_F_02_C032_A_011/2.wav 건강이 안 좋아지니 죄착 죄책감이 드네 5 28 48 5 24 65 16 44 4 16 24 48 4 17 32 71 16 24 17 44 7 44 4 17 35 19 24 45 4 17 35 19 25 45 5 24 60 16 44 4 8 42 7 29
./save/원천데이터/TS1/소음환경1/C(일반인)/F(여성)/F(여성)_1/lip_J_1_F_02_C032_A_011/3.mp4 ./save/원천데이터/TS1/소음환경1/C(일반인)/F(여성)/F(여성)_1/lip_J_1_F_02_C032_A_011/3.wav 요즘에 불면증이 심해진 것 같아 16 36 17 42 60 16 29 4 12 37 52 11 30 48 17 42 65 16 44 4 14 44 60 23 25 17 44 48 4 5 28 63 4 5 24 69 16 24
...
To reduce the required memory resource, we extracted lip features by pre-trained model from here.
We used its visual front-end, the details of using pre-trained model are in the paper.
Run
python inference.py -c {CONFIG_FILE_PATH}
Model | # of params | Eval view | Eval noise | CER | WER | sWER | pt |
---|---|---|---|---|---|---|---|
AV-model |
62M* | View A | All | 3.64 | 10.82 | 8.18 | here |
A-model |
38M* | View A | All | 3.57 | 10.61 | 8.11 | |
V-model |
34M* | View A | All | 26.64 | 47.89 | 50.00 | |
F-model |
45M | View A | Clean | 41.24 | 71.39 | 72.44 | |
All-model |
45M | View A | Clean | 32.16 | 57.35 | 58.00 | here |
(* Do not include pre-trained visual front-end parameters.)
- v1.0.0
- release baseline
- v1.0.1
- update result table
- ICASSP 2024 accepted
The dataset itself is released under custom terms and conditions.
The OLKAVS Scripts are released under MIT license.
@INPROCEEDINGS{10446901,
author={Park, Jeongkyun and Hwang, Jung-Wook and Choi, Kwanghee and Lee, Seung-Hyeon and Ahn, Jun Hwan and Park, Rae-Hong and Park, Hyung-Min},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset},
year={2024},
volume={},
number={},
pages={6385-6389},
keywords={Training;Lips;Mouth;Speech recognition;Signal processing;Predictive models;Speaker recognition;Audio-visual speech datasets;multi-view datasets;lip reading;audio-visual speech recognition;eep learning},
doi={10.1109/ICASSP48485.2024.10446901}}
park32323@gmail.com
@Park323