microsoft/X-Decoder

The loss of referring segmentation

jshilong opened this issue · 3 comments

Thanks for the great work,

In section 4.1, you mentioned that the model was pre-trained on "panoramic segmentation, image-text"
pairs (itp), and referring segmentation.I can't find the details of how you useReferring Segmentationdata in 3.4, would you mind providing more details aboutReferring Segmentation` data loss in the pre-training phase? or did I miss it?

Thanks

It seems it is essentially a binary classification problem

Thanks for your interest in our work, and for bringing up the problem that we do not give details for referring segmentation.

  1. Data preparation: We use all the seg-text pairs from refcoco(g/+) dataset and exclude the validation set. In addition, those images that do not have referring seg ground truth, we use instance segmentation as labels (e.g. person -> all person instance).
  2. Loss Function: For each image with ground truth, we do Hungarian matching between prediction and ground truth. Only text to image loss is applied on referring segmentation. For each text, we train the highest score mask prosal to the ground truth.
yxchng commented

@MaureenZOU
1.Just to clarify. Does "refcoco(g/+)" mean only refcoco+ and refcocog, with refcoco excluded?
2. What does this mean "In addition, those images that do not have referring seg ground truth, we use instance segmentation as labels (e.g. person -> all person instance)"?