Add a sequence classification head to ProtGen2
franzigeiger opened this issue · 2 comments
I'm trying to put a sequence prediction head(= sequence classification with 1 output) on top of the ProGen2 model and am experiencing some problems.
For this one usually needs to pool the last_hidden_state
outputs to create a vector and add a simple MLP on top. When doing this with ProGen2 the predictions end up being exactly the same no matter the input sequence. This is the case because the pooler just takes the hidden state corresponding to the first token, which is in this case always a 1. The first hidden state is not influenced by any consecutive tokens, which leads to predictions being always exactly the same. For some reason the attentions for the first token depends solely on the value of the first token, which is not the case for other transformer models (the pooler works fine in those cases).
Is this a conscious decision to design the hidden_states solely on tokens that come before the current token? In this case pooling over the last hidden state should work for a value prediction head I assume?
I think I figured it out myself:
ProGen2 is based on GPT-J. For GPT-J a sequence-classification implementation exists that handles the problem I'm describing here.
They indeed use the last token of the sequence (but have to be careful if padding was used). I tried it and it seems to work fine that way.