glassroom/heinsen_routing_2022_paper

[Bug & Reproducibility] Padding tokens are not accounted

alexdremov opened this issue · 3 comments

The provided code for text classification does not take into account padding positions. Even though routing algorithm has mask argument to address that, it is not used.

def forward(self, x):
x = self.normalize(x) # [..., n pos, d_depth, d_inp]
x = x * self.W + self.B # [..., n pos, d_depth, d_inp]
x = x.flatten(-3,-2) # [..., n_inp, d_inp]
x = self.route(x) # [..., n_out, d_out]
return x.squeeze(-1) # if d_out is 1, remove it

Therefore, predictions depend on number of padding tokens in sequence, which makes results non-reproducible and almost random. Model can produce different prediction for the same sample with different batching shuffle random seed.

Thank you. I did that on purpose to induce the routing algorithm to learn to ignore (varying numbers of) padding tokens. Remarkably, the algorithm learns to do exactly that. The downside is that if you train the classification head on a small dataset for only a few epochs, you might see a slight variation in test-set accuracy when you shuffle test samples. This is important, so I will add a direct link to your comment on the README. Thank you for pointing it out!

I added a link to this issue on the README. I will leave the code as is, for the record. Thank you again for pointing this out!

Thanks for clarification! Indeed, my problem's domain consists of short texts mostly. I believe that also contributes to the result as there can be quite a lot of padding positions.

I created this issue as I noticed such artifacts in my model's predictions.

In fact, explicit masking improved overall score as well as guaranteed reproducibility.