Where is Distilled Guided Distillation(DGD)?
Opened this issue · 1 comments
pdh930105 commented
I think that DistillationLoss
required distilled guided distillation (or every block's q, k pair).
But, i can't find DGD function.
Can this code show the performance of the paper without the DGD function?
XA23i commented
it's the same with BiBERT[1]. In fact, I think the whole paper is quite similar to BiBERT and IR-Net[2]. just bring it to model quantization.😬
[1] https://github.com/htqin/BiBERT/blob/91fd347eefc490a87275e66be68bfceb27837aee/transformer/modeling_quant.py#L155
[2] https://openaccess.thecvf.com/content_CVPR_2020/papers/Qin_Forward_and_Backward_Information_Retention_for_Accurate_Binary_Neural_Networks_CVPR_2020_paper.pdf