Wrong segmentation of data in the Italian dataset

Question

Wrong segmentation of data in the Italian dataset

giampierosalvi opened this issue 2 years ago · 0 comments

Hi,
I have been listening to some of the utterances from the test set in the Italian dataset and realized that most of the time the segmentation of the audio is wrong. This means that the transcriptions contain several words before and after the content of the audio, or, vice versa, the transcriptions miss words that are at the beginning or end of the audio. This problem is rather significant: most of the utterances I have listened to are affected and it usually includes 5-10 words extra or missing.

Here is a code snippet on how I download and listen to the dataset:

import sounddevice as sd
from datasets import load_dataset
ds = load_dataset("facebook/voxpopuli", "it", split="test")
for item in ds:
   print(item['normalized_text'])
   sd.play(item['audio']['array'], item['audio']['sampling_rate'])
   input("Press Enter to continue...")

I have also downloaded the data according to the scripts in this repository like this:

python -m voxpopuli.download_audios --root [...] --subset asr
python -m voxpopuli.get_asr_data --root [...] --lang it

However, the problem is the same as in the huggingface dataset.

Just to give you an example, 20170403-0900-PLENARY-17-it_20170403-20:18:14_4 which happens to be the first item in the huggingface test set, has the following transcription:
"questo è fondamentale se pensiamo a paesi come la malesia che hanno fatto investimenti anche importanti per differenziarsi da quelli che sono stati elementi di sfruttamento di mancato rispetto dell'ambiente di queste produzioni."

However the audio starts from the end of the word "malesia" and ends with "queste produzioni e se da una parte abbiamo riscrontrato che alcuni paesi si diff" (where the last word is probably "differenziano" but it was cut in the recording.

Can you please verify the data?
Is this problem in the original data from the European Parliament, or is it related to some processing you have done when you prepared the data?
Can it be fixed?

Thank you
Giampiero