Evaluation on Semantic and Open-Vocabulary Segmentation
Closed this issue · 5 comments
Thank you for your outstanding work! The model seems to be more tailored towards Referring Image Segmentation, and I'm still somewhat confused about testing for Semantic Segmentation (SS) and Open-Vocabulary Segmentation (OVS). Although the paper mentions that "Semantic segmentation and open-vocabulary segmentation can be reformulated as language-guided paradigm by replacing output layers with computing the similarity between visual and linguistic embeddings," the process still appears unclear to me.
From what I understand, the model seems to output a mask by calculating the similarity between the activated visual features and content-aware linguistic embedding. However, I'm unsure how this is evaluated in SS or OVS. Here's my guess:
For example, in Open-Vocabulary Segmentation, for a given image, we need to identify which categories are present (say, M categories). Then, for each category, the similarity calculation is performed between the activated visual features and content-aware linguistic embedding, ultimately outputting M masks. These masks are then merged to create the final semantic segmentation map.
Could you please confirm if this understanding is correct? If not, could you provide more details on how the model operates for these tasks?
Thank you for your assistance!
Thanks for your interest of our work. The above understanding is generally correct while there are some minor errors. For the evaluation of SS and OVS benchmarks, we directly take all categories in benchmarks (without pre-identifying) for input. The model generates the output logits (similarity map) for each category and the argmax operation is applied to combine these outputs. If you want to leverage the model to perform SS on daily images (not for metric calculation), the pre-identify is recommended.
OK, I understand. Do all categories in the benchmarks require a forward pass, meaning that each category needs a forward pass? Would this approach be time-consuming? It would be great if you could provide some relevant code as well. Thank you!
To obtain multiple masks for tasks such as semantic segmentation, is it necessary to input multiple prompts, with each prompt corresponding to a single category? For instance, using separate prompts like "all person," "all dogs," "all horses," etc., instead of a single combined prompt like "all person, dogs, horses..."? I'm not sure if my understanding is correct. Could you please clarify?
Thank you so much!
To obtain multiple masks for tasks such as semantic segmentation, is it necessary to input multiple prompts, with each prompt corresponding to a single category? For instance, using separate prompts like "all person," "all dogs," "all horses," etc., instead of a single combined prompt like "all person, dogs, horses..."? I'm not sure if my understanding is correct. Could you please clarify?
Thank you so much!
From Figure 2, it appears that prompts may be formatted as "all person, dogs, horses..." For such a prompt, how can multiple masks be obtained? Because from the Visual-Linguistic Decoding section, it seems that a single prompt can only generate one mask. Could you please explain how multiple masks are derived from a single prompt? Thank you!
Sorry for the late reply. The understanding here is correct. The current model is designed for understanding the specific prompt and generate one mask for the target.
To obtain multiple masks for tasks such as semantic segmentation, is it necessary to input multiple prompts, with each prompt corresponding to a single category? For instance, using separate prompts like "all person," "all dogs," "all horses," etc., instead of a single combined prompt like "all person, dogs, horses..."? I'm not sure if my understanding is correct. Could you please clarify?
Thank you so much!