We will update the dataset and model checkpoints soon!

MDCC Dataset

This repository contains code and meta-data to download the How2 dataset as described in the following paper:

Tiezheng Yu and Rita Frieske and Peng Xu and Samuel Cahyawijaya and Cheuk Tung Shadow Yiu and Holy Lovenia and Wenliang Dai and Elham J. Barezi and Qifeng Chen and Xiaojuan Ma and Bertram E. Shi and Pascale Fung. "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset" Link: https://arxiv.org/pdf/2201.02419.pdf

@misc{yu2022automatic,
      title={Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset}, 
      author={Tiezheng Yu and Rita Frieske and Peng Xu and Samuel Cahyawijaya and Cheuk Tung Shadow Yiu and Holy Lovenia and Wenliang Dai and Elham J. Barezi and Qifeng Chen and Xiaojuan Ma and Bertram E. Shi and Pascale Fung},
      year={2022},
      eprint={2201.02419},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Download

  1. Send an email with title "Download MDCC Dataset" to chinatysonyu@gmail.com. We will reply an email attached with a LICENSE that you need to sign.
  2. Send an email with title "Download MDCC Dataset SIGNED" and attach the PDF of signed LICENSE to chinatysonyu@gmail.com.
  3. We will reply a email with the link to download the dataset.

Download checkpoints

Google Drive Link

How to run the code?

[TODO]