facebookresearch/segment-anything

Text prompt?

asetsuna opened this issue ยท 9 comments

Amazing work! However, I didn't find the text prompt support , is there any plan to release it?

From their FAQ.

image

This is correct, the ability to take text prompts as input is not currently released.

This is correct, the ability to take text prompts as input is not currently released.

so๏ผŒ is there a plan to release the text prompts ability?

Rocsg commented

The paper state that text can be a prompt (theoretically). And describe a little bit the procedure (page 22) that seem to involve a bit retraining SAM including couples (image, text) embeddings computed by a CLIP model. And finally, doing inference with SAM, by using the CLIP model to create direct prompt for SAM, as CLIP aligned the embeddings of text with the embeddings of the image. Not sure there is easy way to unleash this functionality, even more if it involves retraining (I guess the .pth provided does not include the CLIP training).

botcs commented

@Rocsg I can imagine that by using only a few examples and a linear projection layer on both spaces one could see if the SAM feature space is aligned with the CLIP feature space or not. If there's no alignment the matching error would be still high.

We believe directly using the text from text features from CLIP is not good. Because the explainability of CLIP is bad. Using the text features matched with opposite semantic regions lead to wrong results.

This is our work about CLIP's explainability: https://github.com/xmed-lab/CLIP_Surgery

And we can see the self-attention of CLIP links irrelevant regions, with serious noisy activations across labels.
fig1
fig2

We suggest using the corrected heatmap to generate points to replace the manual input points.
This is our similarity map from the raw prediction of CLIP and results on SAM.
fig3
fig4

Besides, it's very simple, just use the original CLIP without any fine-tuning or extra supervisions. It's also another solution besides text->box->mask, with requires the least training and supervision cost.

Maybe they were waiting for DinoV2?

It looks like the Grounded-SAM paper implement this => https://github.com/IDEA-Research/Grounded-Segment-Anything.

As pointed out by @jashvira, it is based on Grounding-Dino that is itself using Dino.