revdotcom/speech-datasets

GigaSpeech Training Set

Closed this issue · 2 comments

Thank you for the great open-source work!

In the paper "UPDATED CORPORA AND BENCHMARKS FOR LONG-FORM SPEECH RECOGNITION," the specific subset of the Gigaspeech data used for training is not explicitly mentioned in the information provided. The paper does mention different subsets such as GigaSpeech M subset, GigaSpeech 200h subset, and GigaSpeech XL subset, but it doesn't specify which one was used for the experiment.

I would like to know if the data used for training in the Gigaspeech experiment is the Train (200h) subset.

Thanks again!

jdrex commented

Hi! Sorry for the delay! The training data for Table 2 and the "original" rows of Table 3 is the M subset (original segments). The training data for the "+ longform" rows of Table 3 is the M subset (original segments) mixed with the 200h subset (30s segments). Thank you for pointing out that this information is missing from the paper - we'll make sure to update it with the information necessary to replicate the experiments.

OK! Thank you very much!