✅ Official data generation and data preparation scripts for CHiME-8 DASR.
We provide a more convenient standalone interface for downloading and prepare the core CHiME-8 DASR data.
This year we also support automatic downloading of CHiME-6.
For in-depth details about CHiME-8 DASR data and rules refer to chimechallenge.org/current/task1/data.
You can install with:
pip install git+https://github.com/chimechallenge/chime-utils
This package brings a new module:
from chime-utils import dgen, dprep, scoring, text_norm
And new CLI commands:
chime-utils dgen
- generates and downloads CHiME-8 data.
chime-utils lhotse-prep
- prepares CHiME-8 data lhotse manifests.
chime-utils espnet-prep
- prepares CHiME-8 data ESPNet/Kaldi-style manifests.
chime-utils score
- scripts used for official scoring.
Hereafter we describe each command/function in detail.
You can generate all CHiME-8 DASR data in one go with:
chime-utils dgen dasr ./download /path/to/mixer6_root ./chime8_dasr --part train,dev --download
If unsure you can always use some --help
.
This script will download CHiME-6, DiPCo and NOTSOFAR1 automatically in ./download
Ensure you have at least 1TB of space there. You can remove the .tar.gz
after the full data preparation to save some space later.
Mixer 6 Speech instead has to be obtained through LDC.
Refer to chimechallenge.org/current/task1/data on how to obtain Mixer 6 Speech.
The Mixer 6 root folder should look like this:
mixer6_root
├── data
│ └── pcm_flac
├── metadata
│ ├── iv_components_final.csv
│ ├── mx6_calls.csv
...
├── splits
│ ├── dev_a
│ ├── dev_a.list
...
└── train_and_dev_files
🔐 You can check if the data has been successfully prepared with:
chime-utils dgen checksum ./chime8_dasr
The resulting ./chime8_dasr
should look like this:
chime8_dasr/
├── chime6/
├── dipco/
├── mixer6/
└── notsofar1/
For more information about CHiME-8 data you should look at the website data page: chimechallenge.org/current/task1/data.
We provide scripts for obtaining each core dataset independently if needed.
Command basic usage: chime-utils dgen <DATASET> <DOWNLOAD_DIR> <OUTPUT_DIR> --download
- CHiME-6 (not CHiME-5 !)
chime-utils dgen chime6 ./download/chime6 ./chime8_dasr/chime6 --part train,dev --download
- If it is already in storage in
/path/to/chime6_root
instead you can use:chime-utils dgen chime6 /path/to/chime6_root ./chime8_dasr/chime6 --part train,dev
- If it is already in storage in
- DiPCo
chime-utils dgen dipco ./download/dipco ./chime8_dasr/dipco --part train,dev --download
- If it is already in storage in
/path/to/dipco
instead you can use:chime-utils dgen dipco /path/to/dipco ./chime8_dasr/dipco --part train,dev
- If it is already in storage in
- Mixer 6 Speech
chime-utils dgen mixer6 /path/to/mixer6_root ./chime8_dasr/mixer6 --part train_call,train_intv,train,dev
- It must be obtained via LDC (again, see CHiME-8 data page) and extracted manually.
- NOTSOFAR1
chime-utils dgen notsofar1 ./download/notsofar1 ./chime8_dasr/notsofar1 --part train,dev --download
- If it is already in storage in
/path/to/notsofar1
instead you can use:chime-utils dgen notsofar1 /path/to/notsofar1 ./chime8_dasr/notsofar1 --part train,dev
- If it is already in storage in
NOTSOFAR download folder should look like this:
.
├── dev
│ └── dev_set
│ └── 240208.2_dev
└── train
└── train_set
└── 240208.2_train
This year CHiME-8 DASR baseline is built directly upon NVIDIA NeMo team last year CHiME-7 DASR Submission [1].
It is available at chimechallenge/C8DASR-Baseline-NeMo/tree/main/scripts/chime8, we used our own fork of NeMo because it will be easier to maintain during the challenge.
For convenience, we also offer here data preparation scripts for different toolkits (parsing the data into different manifests formats):
- Kaldi and K2/Icefall (with lhotse)
- ESPNet (with lhotse)
See DATAPREP.md.
Last but not least, we also provide scripts for scoring (the exact same scripts organizers will use for ranking CHiME-8 DASR submissions).
To learn more about scoring and ranking in CHiME-8 DASR please head over the official CHiME-8 Challenge website.
Note that the following scrips expect the participants predictions to be in the standard CHiME-style JSON format also known as SegLST (Segment-wise Long-form Speech Transcription) format (we adopt Meeteval naming convention [2]).
Each SegLST is a JSON containing a list of
dicts (one for each utterance) with the following keys:
{
"end_time": "43.82",
"start_time": "40.60",
"words": "chime style json format",
"speaker": "P05",
"session_id": "S02"
}
Please head over to CHiME-8 DASR Submission instructions to know more about scoring and text normalization.
The scripts may accept a single SegLST JSON or a folder where multiple SegLST JSON files are contained.
E.g. one per each scenario as requested in CHiME-8 DASR Submission instructions.
For example for the development set:
dev
├── chime6.json
├── dipco.json
├── mixer6.json
└── notsofar1.json
Text normalization is applied automatically before scoring to your predictions.
In CHiME-8 DASR we use a more complex text normalization which is built on top of Whisper text normalization but is crucially different (less "aggressive" and made idempotent).
Examples are available here: ./tests/test_normalizer.py
In detail, we provide scripts to compute common ASR metrics for long-form meeting scenarios. These scores are computed through the awesome Meeteval [2] toolkit.
Command basic usage: chime-utils score <SCORE_TYPE> -s <PREDICTIONS_DIR> <CHIME-8 ROOT> --dset-part <DEV or EVAL> --ignore-missing
--ignore-missing
will ignore from scoring the scenarios which do not have for that --dset-part
system predictions or ground truth (e.g. NOTSOFAR1 dev).
We expect the predictions folder to be in the format outlined previously (dev folder and then JSON SegLST for each scenario).
Two metrics are supported:
- tcpWER [2]
chime-utils score tcpwer
- concatenated minimum-permutation word error rate (cpWER) [3]
chime-utils score cpwer
You can also use chime-utils score segslt2ctm input-dir output-dir
to automatically convert all SegLST JSON files in input-dir
and its subfolders to .ctm
files.
This allows to use easily also other ASR metrics tools such as NIST Asclite.
As well as utils to convert SegSLT (aka CHiME-6 style) JSON annotation to other formats such as .ctm and Audacity compatible labels (.txt) so that systems output can be more in-depth analyzed.
- Segment Time Marked
.stm
format conversion:chime-utils score segslt2stm input-dir output-dir
- Conversation Time Mark
.ctm
format conversion:chime-utils score segslt2ctm input-dir output-dir
- Rich Transcription Time Marked
.rttm
format conversion:chime-utils score segslt2rttm input-dir output-dir
- this allows to use other diarization scoring tools such as dscore.
For ASR+diarization error analysis we recommend the use of this super useful Meeteval tool (will be presented at ICASSP 2024 in a show and tell session):
You can upload your predictions and references in any supported file format (SegLST is preferred) to the JupyterLite notebook running in the browser, run the visualization in a Jupyter notebook locally, or build a visualization in the command-line using
python -m meeteval.viz html -r ref.stm -h mysystem:hyp.stm -h myothersystem:hyp.stm --out=out --regex='.*'
which will generate HTML files that you can open with any browser.
If you wish to contribute, download this repo:
git clone https://github.com/chimechallenge/chime-utils
cd chime-utils
and then install with:
pip install -e .
pip install pre-commit
pre-commit install --install-hooks
[1] Park, T. J., Huang, H., Jukic, A., Dhawan, K., Puvvada, K. C., Koluguri, N., ... & Ginsburg, B. (2023). The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System. arXiv preprint arXiv:2310.12378.
[2] von Neumann, T., Boeddeker, C., Delcroix, M., & Haeb-Umbach, R. (2023). MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems. arXiv preprint arXiv:2307.11394.
[3] Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., ... & Ryant, N. (2020). CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. arXiv preprint arXiv:2004.09249.
[4] Cornell, S., Wiesner, M., Watanabe, S., Raj, D., Chang, X., Garcia, P., ... & Khudanpur, S. (2023). The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios. arXiv preprint arXiv:2306.13734.