perceptiveshawty/RankCSE

Training with Different Student Model?

MohammadVahidiPro opened this issue · 3 comments

Hello,

I wanted to express my gratitude for your contributions to the community. Your work has been incredibly helpful.

I am working on a project involving the training of a DistilBERT base student model using your framework. I have a question regarding the choice of teachers. Do I need to select teachers that are also DistilBERT-based, or can I use the same teachers used in the original paper?

Do you believe there are any further adjustments needed when training a different base model? I would greatly appreciate your insights and advice on this matter.

Thank you sincerely for your help.

Best regards,

See On the Efficacy of Knowledge Distillation - a model like DistillBERT with small capacity will generally require a smaller teacher or under-trained large teacher for distillation to work. Using another DistillBERT-family model as teacher will probably be better.

There are some other ideas about the impact of the sharpness of teacher labels, so maybe sweep different temperature values (for the student and teacher) and see if that helps.

Glad to hear that the repo has been useful to you :) and I'm interested to hear about your results!

Edit: another good discussion, for context https://openreview.net/forum?id=h-z_zqT2yJU&noteId=-FLNfm3Pd8q

fascinating stuff!
Thanks you for the fast reply, I greatly appreciate your guidance, I will certainly try out the ideas you mentioned and hopefully get back to you with promising results. Thanks again for this interesting repo.

best regards