SysCV/sam-hq

Question: Is it SAM-HQ model applicable for predicting segmentation mask for the input images without boxes, point or label?

mzg0108 opened this issue · 2 comments

If I understand it correctly, both SAM and SAM-HQ takes box points, input points and labels (text) as input along with the input image.
What about the input images that we don't have these information available for?

If we want to take the human completely out from the scenario and would want the model to take the input and predict the mask, what changes do we need to make to the model?

we can use the everything mode as demonstrated here, which input uniform sampled points on the images as prompt.

if I'm not mistaken, you need the prompt encoder to determine embeddings on an image to mask. automask generator actually is a bit misleading as it just generates a point prompt every 20 pixels or so. For each point prompt, embeddings are encoded and these are matched with the model on their IOU and the most probable (or top 3) is determined to be the masks.