Question: should the pseudo-labelling model and teacher model be the same?
guynich opened this issue · 2 comments
If I want to use Medium.en
model as teacher, would using another model such as Large_v3
for pseudo-labelling be suitable for the distil-whisper
training methodology? Or should the same model always be used for both?
Ideally you would use the same for both. Since the KL loss is computed from the sequence of generated ids, we want the reference model in the KL loss (teacher during training) to be the same model used to generated the sequence of pseudo labels (the model during pseudo-labelling), to ensure we get the correct KL loss values.
Broadly speaking, it's always best to use the most performant model as the teacher, in order to maximise the performance of your student model. That means you should use large-v3 for both pseudo-labelling and distillation, to ensure you get the highest accuracy pseudo-labels, and thus maximise the accuracy of your student model.
Thank you for helpful comments. Makes sense. Closing.