Question about "S"
BlingHe opened this issue · 1 comments
Hi,
Thanks for sharing this great work.
I have some questions regarding "S". In section 3.2, you mentioned that "S" contains all category names associated with the task, and in section 3.4, you indicated that "S" varies across different tasks.
As per my understanding, for tasks such as referring image segmentation and depth estimation, |S|= 1, representing the given text or a specific category. However, in terms of semantic segmentation, I am uncertain about "S" which contains all category names relevant to this task. Does it refer to all category names present in the current image? If so, how is the loss function designed to establish a linkage between textual information and image content?
Hi,
For semantic segmentation, we construct S using all the categories in the dataset. In our experiment, we use ADE20K dataset containing 150 categories, and thus |S|=150.