Problems in concatenate_dataset
George0828Zhang opened this issue · 0 comments
George0828Zhang commented
In concatenate_dataset()
:
distil-whisper/training/run_pseudo_labelling.py
Lines 644 to 671 in 66ac8dd
From my understanding, the logic in the for loop is
- If either:
- Adding the current utterance to
audio_sample
exceeds 30s - The current
speaker
is different from previous (prev_speaker
)
- Adding the current utterance to
- Then save the concatenation up to the previous utterance (
audio_sample
), excluding the current utterance.
Since the concatenated sample does not contain the current utterance, we have:
- The appended speaker should be
previous_speaker
rather thanspeaker
condition_on_prev
signifies continuity at the start of current utterance, so this should be shifted to the right by 1 (e.g. initialize ascondition_on_prev = [0]
)
Meanwhile, it seems that the very last accumulated sample in each batch did not get appended, i.e. when the for loop exits, there will be a (audio_sample, text_sample)
pair that is <= 30s which should've been appended but didn't.
These may not seem significant, but when finetuning on custom dataset with diverse speakers, and condition_on_prev is expected to be true alot, it will cause wrongful training signals.