self-supervised training architecture

Question

self-supervised training architecture

Closed this issue a year ago · 1 comments

Hi,

I have been looking into the nn architecture for pre-train in detail. However, I find it a bit puzzling due to the difference between the figure in the paper and the code in file models.py.

class Transformer(nn.Module):
    """ Transformer with Self-Attentive Blocks"""
    def __init__(self, cfg):
        super().__init__()
        self.embed = Embeddings(cfg)
        # Original BERT not used parameter-sharing strategies
        # self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layers)])

        # To used parameter-sharing strategies
        self.n_layers = cfg.n_layers
        self.attn = MultiHeadedSelfAttention(cfg)
        self.proj = nn.Linear(cfg.hidden, cfg.hidden)
        self.norm1 = LayerNorm(cfg)
        self.pwff = PositionWiseFeedForward(cfg)
        self.norm2 = LayerNorm(cfg)
        # self.drop = nn.Dropout(cfg.p_drop_hidden)

    def forward(self, x):
        h = self.embed(x)

        for _ in range(self.n_layers):
            # h = block(h, mask)
            h = self.attn(h)
            h = self.norm1(h + self.proj(h))
            h = self.norm2(h + self.pwff(h))
        return h

The code above does not match the figure shown in the paper. Could you please tell me which one should I follow to obtain the ideal result?

Answer 1 · 2023-08-19T08:05:49.000Z

Hi,
Thanks for pointing out our oversight. There should be an Add&Norm after the attention layer, which should be h = self.norm(self.attn(h) + h). The experiments in our paper were done with the old codes. You can update it accordingly. But based on my experience, this update would not introduce a huge difference to the end performance. Many thanks again for your interest!