Ground truth transcriptions contain no timestamps
snakers4 opened this issue · 3 comments
Hi,
Ground truth transcriptions contain no timestamps, e.g.:
Also it is strange - the outputs of other systems (i.e. Google, Amazon) contain timestamps, whereas your system output are in different format.
Is all of this a bug, or a feature?
Can your dataset be just used as-is without pulling extra dependencies / tools?
Best,
Alex
Hi there,
With respect to the transcriptions that's actually a choice we made. Our method of generating timestamps wouldn't be perfect and we'd have to leave some tokens without timing information - as a result we decided to provide the ground truth transcriptions as is with out any timestamps.
With respect to our outputs that's a great catch - I'll make a PR to put them into the same format as the other systems for ease of use.
The dataset can definitely be used as-is without extra tools! We recommend to use our fstalign tool for the sake of reproducibility and the features it provides for WER calculation. For example, it'll facilitate the calculation of WER by entity class. But feel free to use the tool that works best for your use case
Best,
Miguel
We've updated the output directories of our models to include a directory with the nlp
format to match the other system outputs.
I'll be closing this issue for the time being but if you have any more questions feel free to comment again or open up a new issue!
Best,
Miguel
Many thanks