GigaSpeech Training Set

Question

GigaSpeech Training Set

Closed this issue a year ago · 2 comments

Thank you for the great open-source work!

In the paper "UPDATED CORPORA AND BENCHMARKS FOR LONG-FORM SPEECH RECOGNITION," the specific subset of the Gigaspeech data used for training is not explicitly mentioned in the information provided. The paper does mention different subsets such as GigaSpeech M subset, GigaSpeech 200h subset, and GigaSpeech XL subset, but it doesn't specify which one was used for the experiment.

I would like to know if the data used for training in the Gigaspeech experiment is the Train (200h) subset.

Thanks again!

Answer 1 · 2023-12-26T19:45:48.000Z

Hi! Sorry for the delay! The training data for Table 2 and the "original" rows of Table 3 is the M subset (original segments). The training data for the "+ longform" rows of Table 3 is the M subset (original segments) mixed with the 200h subset (30s segments). Thank you for pointing out that this information is missing from the paper - we'll make sure to update it with the information necessary to replicate the experiments.

Answer 2 · 2024-01-04T01:39:53.000Z

OK! Thank you very much!