thu-coai/DA-Transformer

model miniaturization

Opened this issue · 3 comments

Hi, I tried to train a miniaturized model with 6-layer encoder 3-layer decoder and 256 hidden dims, but found that the accuracy of the model declines rapidly. Is there any suggestion for model miniaturization? Thanks.

Thanks for your interest. Unfortuantely, we did not try architectures other than transformer-base.
In my intuition, both the encoder and decoder are important in capturing the data information. Especially, a large decoder would help glancing training, which is critical for the final performance. I think that using knowledge distillation may be helpful to reduce the model size.
Please feel free to discuss here and it would be very grateful if you could share your findings.

Thanks for your reply. The main problems of the model I trained are the decline of translation fluency, multimodal problems, miss and over translations. Do you have any experience with this?

@JunchengYao The problems you mentioned are very common in NAT models. They are caused by the nature of parallel prediction nature and the conditional independent assumption. Many recent studies (including our DAT) are working hard to alleviate the problems, however, no mature solution exists especially if you use a small model.