Question about extracting Q and P representations

Question

Question about extracting Q and P representations

Borororo opened this issue 5 years ago · 1 comments

Hi, thanks for sharing the codes!
When I was reading through your codes, and I noticed that

    # Skip [CLS]; then the next max_ques_len tokens are question tokens        
    encoded_question = bert_out[:, 1 : self.max_ques_len + 1, :]

    question_mask = (pad_mask[:, 1 : self.max_ques_len + 1]).float()

    # Skip [CLS] Q_tokens [SEP]

    encoded_passage = bert_out[:, 1 + self.max_ques_len + 1 :, :]

    passage_mask = (pad_mask[:, 1 + self.max_ques_len + 1 :]).float()

    passage_token_idxs = question_passage_tokens[:, 1 + self.max_ques_len + 1 :]

I know it is much more convenient for us to separate Q and P representations for all instances if we pad all questions to a fixed length(max_ques_len).
However, can we separate Q and P dynamically depending on the real length of each instance? Is there any limitation here?

Answer 1 · 2020-02-28T18:19:37.000Z

Ideally, yes.

While creating instances you could just keep [CLS] question [SEP] paragraph [SEP]
and keep track of the question length. In the forward function of the model you could extract these question representations based on their lengths. This would still require you to PAD questions till the longest question in this batch.

In my opinion, this increases code complexity without a significant improvement in runtime. Also opens up possibilities for potential bugs. Hence I didn't try doing this.