Question on the comparison between GPT and GPT2
mrzjy opened this issue · 4 comments
Hi, Thanks for sharing the models! There's a detail that I'm curious about. Is there a reason why CDialGPT2LCCC performs worse than CDialGPTLCCC? I get that GPT2 uses pre-LayerNorm and also adds an additional layerNorm after final attention block compared with GPT, but I do not expect such difference results in much worse performance in CDialGPT2LCCC.
Besides, in the paper, you mentioned that both CDialGPTLCCC and CDialGPT2LCCC are firstly pretrained on your Chinese novel dataset. Does this imply that there's also a GPT2_novel model that you did not release (based on which CDialGPT2LCCC is post-trained)?
Since we do not have GPT2_novel trained on Chinese novels corpus, The CDialGPT2 is initialized from the GPT_novel.
The parameters in GPT2 that do not exist in GPT are initialized from scratch.
Since we do not have GPT2_novel trained on Chinese novels corpus, The CDialGPT2 is initialized from the GPT_novel.
The parameters in GPT2 that do not exist in GPT are initialized from scratch.
Thanks for the reply, but how does it make sense to initialize GPT2 with GPT checkpoint ? Could this be part of the reason why your CDialGPT2LCCC perform worse than CDialGPTLCCC ?
Since we do not have GPT2_novel trained on Chinese novels corpus, The CDialGPT2 is initialized from the GPT_novel.
The parameters in GPT2 that do not exist in GPT are initialized from scratch.Thanks for the reply, but how does it make sense to initialize GPT2 with GPT checkpoint ? Could this be part of the reason why your CDialGPT2LCCC perform worse than CDialGPTLCCC ?
Q1: "but how does it make sense to initialize GPT2 with GPT checkpoint ?"
A: It is just a try. We just try to figure out would it degrade the performance.
Q2: " Could this be part of the reason why your CDialGPT2LCCC perform worse than CDialGPTLCCC"
A: Yes.
Okay~
Again, thanks for your work~