Seemingly missing reshaping operation in point prompt encoding in PromptEncoder

Question

Seemingly missing reshaping operation in point prompt encoding in PromptEncoder

Opened this issue 2 months ago · 0 comments

https://github.com/facebookresearch/segment-anything/blob/main/segment_anything/modeling/prompt_encoder.py#L81-L85

If I read it correctly, the shape of points input is [B, N, 2], where B is the batch size and N is the number of points per image. The padding ensures that the point prompt also contains the 2d coordinates of two points to make it compatible with the box prompt. Without reshaping operation before the torch.cat operation, wouldn't the shape become [B, N + 1, 2] after the padding. This doesn't feel right. Since this PromptEncoder is used in the SAM2 as well, it seems to impact both models.

Please correct me if I misunderstand any part of this.

Thank you!