mbzuai-oryx/groundingLMM

GLaMM-FullScope model generates only a single mask

preddy5 opened this issue · 2 comments

Hi @hanoonaR
Congrats on the CVPR acceptance. Great work, thank you for sharing the code and the model weights.

I have a couple of questions.

--------------------------------------------- Q1 --------------------------------------------------------
I was trying to reproduce the results using the balloon.jpg image available in the repo using the prompt "Describe the image. Please output interleaved segmentation mask." However the network does not seem to generate multiple masks inspite of the generate text being "The image shows a <p> hot air balloon </p> [SEG] flying over a <p> river </p> [SEG] . The <p> sky </p> [SEG] is visible over the river."

I went a step further to check if the issue is from my side. Below are the generated "generated_output_ids "

[  319, 13563,  1546,   263, 12758,  5199,   322,   385, 23116, 21082,
         20255, 29889,   450, 20255,  4076,  8444, 29892, 13173, 29892,   322,
          1248,   568,  6089,   304,   278,  5199, 29915, 29879,  5155, 29889,
          3148,  1001, 29901,   450, 32000,  -200, 29871, 32001, 16123,  2247,
           385,   975,  1493,   310,   278,  7623, 29889,    13,  4002, 29581,
           278,  1967, 29889,  3529,  1962,  1006,   280, 10511, 10768,   362,
         11105, 29889,   319,  1799,  9047, 13566, 29901,   450,  1967,  3697,
           263, 32005,  7375,  4799,  6411,   417,   265, 32006, 32004, 22764,
           975,   263, 32005,  8580, 32006, 32004,   869,   450, 32005, 14744,
         32006, 32004,   338,  7962,   975,   278,  8580, 29889,     2]

As you can see id 29871(seg_token_idx) is generated only once. I am not sure if I am missing something in my attempts to reproduce the results and I would appreciate your educated guess of what I might be doing wrong.

--------------------------------------------- Q2 --------------------------------------------------------
Another interesting property I observed, when I run tokenizer("[SEG]").input_ids the output indices are [ 1, 29871, 32004] where as tokenizer("a [SEG]").input_ids returns [ 1, 263, 32004] as you can notice the tokenizer outputs id 29871(seg_token_idx) in the first case is this expected, I am curious to understand the intuition behind this.

Thank you, I appreciate any time you can spend to help with my questions.

Regards,
Pradyumna.

Hi @preddy5,

Thank you for your interest in our work.

Regarding your first question, it seems there's been a mix-up with the segmentation token index. In the model checkpoints we've provided, the correct seg_token_idx is actually 32004, not 29871. You can verify this by checking the value of args.seg_token_idx in train.py. Based on your generated_output_ids, 32004 does indeed appear three times, which aligns with the expected behavior for generating multiple masks. The issue might lie within the mask decoder's processing. Could you double-check how the masks are being decoded and ensure that it correctly interprets each occurrence of 32004? This should resolve the issue with mask generation.

For your second question, the discrepancy you've noticed is indeed expected due to the way the tokenizer handles special tokens within different contexts. When you tokenize "[SEG]" directly, the tokenizer recognizes it as a special token and assigns its specific ID, 32004 in this case. However, when "a [SEG]" is tokenized, the "[SEG]" is not recognized as a standalone special token due to the preceding text, leading to a different tokenization outcome. This behavior is by design, to allow the tokenizer to differentiate between special tokens and regular text.

I hope this clarifies your queries. Apologies for the delayed response, and I'll make sure to be more prompt in the future.

Hey @hanoonaR
Thank you for the response.
I model works like a charm generating expected results after changing the seg_token_idx to 32004.
Thank you again for the clarification. I appreciate you helping with my query.

Regards,
Pradyumna.