ReLU in residual connections?

Question

ReLU in residual connections?

ibagur opened this issue 2 years ago · 4 comments

Hi,

I am using part of your code for a particular implementation of a transformer architecture I need as part of my master thesis research in RL. I noticed on the original paper from (Parisotto et al., 2019) that they re-order the LayerNorms so they place them at the input of both the multihead-attention and the feed-forward sub-modules. I saw that you also implement this on your code, via a the config["layer_norm"] setting. But on the paper they also mention, I quote: "Because the layer norm reordering causes a path where two linear layers are applied in sequence, we apply a ReLU activation to each sub-module output before the residual connection (see Appendix C for equations).". In fact, on those equations they apply a ReLU both to the output of the multihead-attention and feed-forward sub-modules, before performing the residual connection. I did not see that specific step on your code (just the standard residual connection), so I wonder whether there is a particular reason for that, or maybe I am missing something (I'm still quite novice in these implementations). In any case, congratulations for your great works, it is helping me a lot to understand the inner workings of such architectures. Thanks!

Answer 1 · 2023-02-08T11:14:27.000Z

Thanks for bringing up this issue. GTrXL is still work in progress. We will investigate this detail on ReLU.

Answer 2 · 2023-02-13T15:00:48.000Z

I took some time to train TrXL with pre layer norm plus the missing ReLU activations on Mortar Mayhem Grid and Mystery Path Grid. I did not observe any gains in performance.

# GRU Gate or skip connection
if self.use_gtrxl:
    # Forward GRU gating
    h = self.gate1(query, attention)
else:
    if self.layer_norm == "pre":
        attention = F.relu(attention)
    # Skip connection
    h = attention + query

# GRU Gate or skip connection
if self.use_gtrxl:
    # Forward GRU gating
    out = self.gate2(h, forward)
else:
    if self.layer_norm == "pre":
        forward = F.relu(forward)
    # Skip connection
    out = forward + h

Here is a rough plot on the training performance in Mystery Path Grid. Each experiment was run for 4 different seeds.

Answer 3 · 2023-02-15T10:48:08.000Z

I'm closing this issue for now. We decided to not add the ReLU activations. We don't have the time to further investigate this right now.

Answer 4 · 2023-02-15T10:50:56.000Z

Hi,

Many thanks for your feedback and very interesting results. Very much appreciated, I might give it a look on my own research, but as I can see, it seems not to have much effect (apart of adding a little bit more of computation in the model)