mlvlab/Flipped-VQA

Concerns and Clarifications Regarding MCQ to Generation Task Conversion

Closed this issue · 3 comments

Hi there!
I have been working on converting the task from MCQ settings to generation settings. So, I have modified the data loader part to remove the choices given along the input and make the output return the full answer directly.

This is a summary of my additions to add the support for the generation task:

  • Extracted the most likely token sequence from vqa_output using torch.argmax and reshaped it to match batch and sequence length dimensions.
  • Created a mask (vqa_placeholder_mask) to identify the answer part in the sequence.
  • Implemented logic to extract answers from each choice in the batch, considering start and end tokens.
  • Encoded extracted answers to tensors, padded them for uniform length, and converted them into embeddings.
  • Aggregated the answer embeddings and reshaped them to match the batch and embedding size dimensions.
  • Filtered the output tokens based on the placeholder mask to identify relevant answer parts.
  • Processed each set of output tokens, identifying the end of answers using 'eos_id', and embedded the tokens.
  • Aggregated these embeddings along the sequence length by computing the mean along the sequence dim.
  • Calculated cosine similarity for each instance in the batch, considering the options.

However, when training the model, I had noticed that the output answers from the model are not meaningful, and thus the similarity computation does not work as expected.
I can see in the code that during inference, the model is given all the choices, so I interpreted this as: "the model is given the same input multiple times since anyways the choice part is not included in the loss computation". Depending on this assumption, I extracted the first output from every batch and took this as the output that will be used for similarity computations. However, debugging the results showed that my assumption was incorrect. I'm not sure about the approach I should go for now to get the similarity correct.

Here, I'm summarizing my questions:

  • Is the assumption that the output should be the same across the 5 options due to input masking correct, or is it more appropriate to input only one option and compare the generated answer?
  • Are there any potential issues or improvements in the way answers are extracted, encoded, and embedded?
  • Is the current method of calculating cosine similarity between the generated answer and the other options the optimal approach for this task?

First, what is the difference between vqa_placeholder_mask and vqa_label?
I think you can use vqa_label as it is, but simply changing the answer part of the input text (e.g., (A) -> playing soccer).

Also, if I understand correctly, mean pooling is applied to embeddings across the sequence length of the answer.
However, generally, the loss is computed for each single token (embedding) of the answer sequence.
You may refer to qav_loss to understand how this works.
Then, you may not need to similarity calculation process.

Finally, if you convert the MCQ setting to the generation task which still includes multiple options, I conjecture that including the options in the input text is more appropriate.
However, in my own preliminary experiment, the MCQ setting shows slightly better performance than the generation task, which might get better through hyperparamter tuning, when handling multiple options.

Thank you for the reply,

First, what is the difference between vqa_placeholder_mask and vqa_label? I think you can use vqa_label as it is, but simply changing the answer part of the input text (e.g., (A) -> playing soccer).

Yes, you are right. The end result of this operation is still similar to what you mentioned.

Also, if I understand correctly, mean pooling is applied to embeddings across the sequence length of the answer. However, generally, the loss is computed for each single token (embedding) of the answer sequence. You may refer to qav_loss to understand how this works. Then, you may not need to similarity calculation process.

I haven't removed the original loss computation. However, I still want to add the similarity computation to provide more meaningful analysis to my work.

Finally, if you convert the MCQ setting to the generation task which still includes multiple options, I conjecture that including the options in the input text is more appropriate. However, in my own preliminary experiment, the MCQ setting shows slightly better performance than the generation task, which might get better through hyperparamter tuning, when handling multiple options.

I agree that providing the options would allow the model to perform better, but I would like to see how the model performs in a pure generation task without the options been given at all.

I am very concerned about the issue that I'm encountering that makes the model not giving meaningful answers.

I would appreciate it if you provide the inference script as well.

Thanks!

You may simply remove train_one_epoch() in train.py and add --resume ./your/own/checkpoint.pth to your running command for the inference.