An exmpale of the data structure required by ESPnet. The required format can be found in data
|-- data
| |-- test
| | |-- spk2utt # Mapping a speaker-ID to a list of utterance-IDs
| | |-- text # Mapping a utterance-ID to a text
| | |-- utt2spk # Mappinng a utterance-ID to a speaker-ID
| | `-- wav.scp # Mappinng a utterance-ID to a path of audio file
| |-- train
| | |-- spk2utt
| | |-- text
| | |-- utt2spk
| | `-- wav.scp
| `-- valid
| |-- spk2utt
| |-- text
| |-- utt2spk
| `-- wav.scp
ESPnet follows the data strcutre developed by Kaldi-asr: A data-directory must contain some texts, wav.scp, text, and etc. which have common format to describe DNN corpus. The format is space separated and must be two columns. The first column is taken as ID and the second is some value.
<ID> <value>
<ID> <value>
...
e.g. wav.scp shows <Sample ID> <Wave file path>.
-
The directory name in
datais arbitrary:train,valid, andtestcan betr,cv, andevalfor example. -
The path for
wav.scpcan be both absolute path or relative path from the base directory (egs2/<corpus-name>/<task-name>, e.g.egs2/an4/asr1).uttidA /absolute/path/uttidA.wav uttidB ./relative/path/uttidB.wav- Maybe, we assume
monauraland16bit-signed-integer-pcmaudio file. Any sampling rates are okay.- Please check your audio format in advance using
soxorffmpegfor example.soxi youraudio.wav
- Please check your audio format in advance using
- (THIS IS EXTENSION BY ESPNET, NOT KALDI FORMAT) Not only
wav, but alsoflaccan be used (Supported audio format can be extended)uttidA a.flac uttidB b.flac wavsdirectory exists at the same level ofdatain this example, but this is not requirement.
- Maybe, we assume
-
If you don't have speaker information, it can be dummy data because actually most recipes don't use speaker information.
uttidA dummy uttidB dummy uttidC dummy ... -
spk2uttcan be gernated fromutt2spkusing utils/utt2spk_to_spk2utt.pl, andutt2spkcan be gernated from fromspk2uttusing utils/spk2utt_to_utt2spk.pl.utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt utils/spk2utt_to_utt2spk.pl data/train/spk2utt > data/train/utt2spk
-
To check and force your directory to satisfy expected format, use utils/validate_data_dir.sh and utils/fix_data_dir.sh
# check utils/validate_data_dir.sh --no-feats data/train # Force format (This is irreversible change) utils/fix_data_dir.sh data/train -
It's okay to contains the other files (
foo.txtin the following example) in the data directory. They are not referred.data `-- test |-- foo.txt |-- spk2utt |-- text |-- utt2spk `-- wav.scp
|-- segments
|-- spk2utt
|-- text
|-- utt2spk
`-- wav.scp
If your audio data is long recording and each audio file includes multiple utterances, you need to put segments file to specify the start time and end time of each utterance. The format is <utterance_id> <wav_id> <start_time> <end_time> (in seconds).
sw02001-A_000098-001156 sw02001-A 0.98 11.56
...
Note that if using segments, wav.scp has <wav_id> instead of utterance_id.
sw02001-A /path/to/sw02001-A.wav
...
wav.scp provides a feature to describe the file format of wavfiles converted by an arbitral command without saving the files actually. The usage is as following:
foo_id some_command /path/to/foo.wav |
Note that in our style, the environment variable for scripts are set by path.sh, so please check that the command exists in the ${PATH}.
If a line ends with |, it indicates using this pipeline mechanism and our Python script derives the output data from the command via pipeline (We are using https://github.com/nttcslab-sp/kaldiio).
e.g. Change sampling rate, encoding, bits, etc.
sox input.wav -b16 -e unsigned-integer -r 16000 -t wav - |
e.g. Channel selection
sox stereo.wav -c 1 -t wav - |
sox stereo.wav -c 2 -t wav - |