llStringll/Transformer-model-encoder

(in unidirectional attention)Works for single example in data-set, as size of data-set goes greater than 1,network starts memorizing

Closed this issue · 2 comments

Sequence size is 10, so if corpus size is also 10, that is, there's only 1 sequence to be learnt and predicted, but as size of corpus grows, growing the number of sequences(corpus_size/10), the model doesn't converge at all, instead diverges.

Finally found the reason, and its interesting.
As it predicts next entries based on previous entries, through attention, what happens is, the entries at the end of the sequence get predicted even in multiple example cases, for the ones at the beginning it has lesser and lesser entries to be dependent on, and to attend to as we move towards the beginning of the sequence, so it can't predict those properly.
For a single case, it just "MEMORIZES" the sequence as it is, without paying "attention". But when comes to multiple sequences, it has to do the "learning", to predict entries, and it cant "attend" for the entries at the beginning, so it basically tries to "memorize" the initial entries of all sequences and ends up jumbling up those "memorized" entries, because it doesn't have prior probability to "attend" to.
So clearly, it memorizes initial entries, and as it moves towards the end of sequence it starts actual "learning", because it can, through "attending" to prior ones.
Thence, for the attention based system(like human mind), memorization is preferred when prior input is less, but prefers actual "learning", by attending to prior inputs(or by understanding the core concepts, as in human mind case), when prior inputs are present(or when knowledge of core concept is present as in case of human mind).
Even though memorization is difficult compared to learning by building logic(both in NN and human brain), it is preferred when conceptual knowledge is missing.
In NN, a network with more depth is required to memorize, clearly seen in this case, by giving more layers to attention encoder, accuracy increases for multiple examples, even though attention of prior entries doesn't improve that much, clearly showing memorization.

This issue was for a very old version of this project, current version of this project is long way from this issue, but the ideology depicted in the issue still holds true.