self-supervised training architecture
Closed this issue · 1 comments
tonyyunyang commented
Hi,
I have been looking into the nn architecture for pre-train in detail. However, I find it a bit puzzling due to the difference between the figure in the paper and the code in file models.py
.
class Transformer(nn.Module):
""" Transformer with Self-Attentive Blocks"""
def __init__(self, cfg):
super().__init__()
self.embed = Embeddings(cfg)
# Original BERT not used parameter-sharing strategies
# self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layers)])
# To used parameter-sharing strategies
self.n_layers = cfg.n_layers
self.attn = MultiHeadedSelfAttention(cfg)
self.proj = nn.Linear(cfg.hidden, cfg.hidden)
self.norm1 = LayerNorm(cfg)
self.pwff = PositionWiseFeedForward(cfg)
self.norm2 = LayerNorm(cfg)
# self.drop = nn.Dropout(cfg.p_drop_hidden)
def forward(self, x):
h = self.embed(x)
for _ in range(self.n_layers):
# h = block(h, mask)
h = self.attn(h)
h = self.norm1(h + self.proj(h))
h = self.norm2(h + self.pwff(h))
return h
The code above does not match the figure shown in the paper. Could you please tell me which one should I follow to obtain the ideal result?
dapowan commented
Hi,
Thanks for pointing out our oversight. There should be an Add&Norm after the attention layer, which should be h = self.norm(self.attn(h) + h)
. The experiments in our paper were done with the old codes. You can update it accordingly. But based on my experience, this update would not introduce a huge difference to the end performance. Many thanks again for your interest!