Text prompt?

Question

Text prompt?

asetsuna opened this issue 2 years ago · 9 comments

asetsuna commented 2 years ago

Amazing work! However, I didn't find the text prompt support , is there any plan to release it?

Answer 1 · 2023-04-05T14:53:10.000Z

From their FAQ.

Answer 2 · 2023-04-05T21:13:59.000Z

This is correct, the ability to take text prompts as input is not currently released.

Answer 3 · 2023-04-06T09:33:51.000Z

This is correct, the ability to take text prompts as input is not currently released.

so， is there a plan to release the text prompts ability?

Answer 4 · 2023-04-06T15:16:09.000Z

The paper state that text can be a prompt (theoretically). And describe a little bit the procedure (page 22) that seem to involve a bit retraining SAM including couples (image, text) embeddings computed by a CLIP model. And finally, doing inference with SAM, by using the CLIP model to create direct prompt for SAM, as CLIP aligned the embeddings of text with the embeddings of the image. Not sure there is easy way to unleash this functionality, even more if it involves retraining (I guess the .pth provided does not include the CLIP training).

Answer 5 · 2023-04-09T23:56:50.000Z

@Rocsg I can imagine that by using only a few examples and a linear projection layer on both spaces one could see if the SAM feature space is aligned with the CLIP feature space or not. If there's no alignment the matching error would be still high.

Answer 6 · 2023-04-13T05:58:23.000Z

We believe directly using the text from text features from CLIP is not good. Because the explainability of CLIP is bad. Using the text features matched with opposite semantic regions lead to wrong results.

This is our work about CLIP's explainability: https://github.com/xmed-lab/CLIP_Surgery

And we can see the self-attention of CLIP links irrelevant regions, with serious noisy activations across labels.

We suggest using the corrected heatmap to generate points to replace the manual input points.
This is our similarity map from the raw prediction of CLIP and results on SAM.

Besides, it's very simple, just use the original CLIP without any fine-tuning or extra supervisions. It's also another solution besides text->box->mask, with requires the least training and supervision cost.

Answer 7 · 2023-04-15T10:37:57.000Z

https://github.com/luca-medeiros/lang-segment-anything

this project looks interesting

Answer 8 · 2023-04-19T07:02:30.000Z

Maybe they were waiting for DinoV2?

Answer 9 · 2023-06-28T15:46:16.000Z

It looks like the Grounded-SAM paper implement this => https://github.com/IDEA-Research/Grounded-Segment-Anything.

As pointed out by @jashvira, it is based on Grounding-Dino that is itself using Dino.