ECCC: Khmer Speech Translation Corpus of the Extraordinary Chambers in the Courts of Cambodia

This repository is for sharing the dataset of Khmer speech translation corpus for the low-resource speech translation track at Workshop on Asian Translation (WAT) 2022

Background of ECCC

It is a spoken language translation (SLT) corpus of Khmer to English and French, namely ECCC, which is an international court dataset consisting of text and speech in Khmer, English, and French. However, this provided dataset is only for Khmer SLT, which has the speech in Khmer and translation text into English and French. This dataset has a wide range of speakers: witnesses, defendants, judges, clerks or officers, co-prosecutors, experts, defense counsels, civil parties, and interpreters.

Statistical data

We randomly selected 20% of the original ECCC of the Khmer SLT [Soky et al.,2021].

Dataset # utterances Duration
Training 11.563 29:07:31
Dev 624 1:39:53
Testing 626 1:38:27

Baseline Systems

The goal of this task is to translate from the Khmer speech to English/French.

Baseline setting

The baseline was trained on ESPnet toolkit using Transformer-based architecutre with the following setting:

- ASR

Module size
Encoder 6
Decoder 6
FFN units 1024
Attention head 4
Attention-dim 256
Epochs 60
Batch-size 64
BPE 3000

- MT

Module size
Encoder 6
Decoder 6
FFN units 1024
Attention head 4
Attention-dim 256
Epochs 100
Batch-size 96
BPE 3000 per language

- ST

Module size
Encoder 6
Decoder 6
FFN units 1024
Attention head 4
Attention-dim 256
Epochs 60
Batch-size 64
BPE 3000 per language
  • Note: The ASR encoder is also used to initialize the ST encoder, and MT decoder is initialized the ST decoder.

Results

System Task Performance Input output
ASR Khmer 21.5% (WER) Khmer speech Khmer text
MT Khmer-English 11.3 (BLEU) Khmer text English text
MT Khmer-French 8.7 (BLEU) Khmer text French text
ST Khmer-English 5.1 (BLEU) Khmer speech English text
ST Khmer-French 5.1 (BLEU) Khmer speech French text

Citation

@INPROCEEDINGS{soky-eccc-2021,
    author={Soky, Kak and Mimura, Masato and Kawahara, Tatsuya and Li, Sheng and Ding, Chenchen and Chu, Chenhui and Sam, Sethserey},
    booktitle={2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)},
    title={Khmer Speech Translation Corpus of the Extraordinary Chambers in the Courts of Cambodia (ECCC)},
    year={2021},
    pages={122-127},
    doi={10.1109/O-COCOSDA202152914.2021.9660421}}

License

All the resources are property of NIPTICT, NICT, and Kyoto University. This dataset will only allow to use in the WAT 2022.

Acknowledgement

We would like to thank the National Institute of Posts, Telecoms, and Information Communication Technology (NIPTICT), Phnom Penh, Cambodia for giving us the resources of ECCC corpus.