where is the Knowledge Distillation in the code

Question

where is the Knowledge Distillation in the code

Opened this issue 3 years ago · 1 comments

hello ，distillation of knowledge is mentioned In the paper , but i didn't see in the code

Answer 1 · 2021-08-28T17:29:04.000Z

Hello. Yeah, Knowledge Distillation is indeed mentioned in the paper. They used Multi-Headed-Attention RNN as the "teacher" model and the Keyword Transformers as the student models.

In the author's official TF rrepository in the Acknowledgements section, note that they mention that their whole repository is actually built upon Google Research's KWS Streaming repository. So they could easily use the MH-Att-RNN model from KWS Streaming for Knowledge Distillation!

However I'm working in Torch, and also from scratch. In order to reproduce the KD used in the paper, I will need to either a) find a pre-trained PyTorch version of MH-Att-RNN, or b) implement and train one myself! After that I can do KD. That's a lot of work (and the accuracy improvement is quite small), so I haven't attempted it yet : D