Question about extracting Q and P representations
Borororo opened this issue · 1 comments
Hi, thanks for sharing the codes!
When I was reading through your codes, and I noticed that
# Skip [CLS]; then the next max_ques_len tokens are question tokens
encoded_question = bert_out[:, 1 : self.max_ques_len + 1, :]
question_mask = (pad_mask[:, 1 : self.max_ques_len + 1]).float()
# Skip [CLS] Q_tokens [SEP]
encoded_passage = bert_out[:, 1 + self.max_ques_len + 1 :, :]
passage_mask = (pad_mask[:, 1 + self.max_ques_len + 1 :]).float()
passage_token_idxs = question_passage_tokens[:, 1 + self.max_ques_len + 1 :]
I know it is much more convenient for us to separate Q and P representations for all instances if we pad all questions to a fixed length(max_ques_len).
However, can we separate Q and P dynamically depending on the real length of each instance? Is there any limitation here?
Ideally, yes.
While creating instances you could just keep [CLS] question [SEP] paragraph [SEP]
and keep track of the question length. In the forward
function of the model you could extract these question representations based on their lengths. This would still require you to PAD questions till the longest question in this batch.
In my opinion, this increases code complexity without a significant improvement in runtime. Also opens up possibilities for potential bugs. Hence I didn't try doing this.