LIL Implementation

Line 119 in ec914ca

phrase_level_logits = self.phrase_logits(phrase_level_activations)

The implementation of LIL differs from what is in the paper. I am a bit confused on that aspect as well. If we are going via this implementation then mean that we are taking is not actually division by len(nt) matrix.

Yes, that is correct. We have been trying different variations (sum of hidden vs average of hidden states) such that the changes don't affect performance. This is a new change we tested that did not affect the result and yet made the code cleaner. So, we decided to keep it. If you find that the average vs. sum does make a difference in your experiments, please feel free to reopen the issue and/or send a PR.