🍇 [Read our arXiv Paper] 🍎 [Project Page]
🔥 We present Set-of-Mark (SoM), simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.
Users can select which granularity of masks to generate, and which mode to use between automatic (top) and interactive (bottom). A higher alpha blending value (0.4) is used for better visualization.
SoM enables interleaved prompts which include textual and visual content. The visual content can be represented using the region indices.
![Screenshot 2023-10-18 at 10 12 18](https://private-user-images.githubusercontent.com/34880758/276056435-f5e0c0b0-58de-4b60-bf01-4906dbcb229e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQwMDI1OTEsIm5iZiI6MTcxNDAwMjI5MSwicGF0aCI6Ii8zNDg4MDc1OC8yNzYwNTY0MzUtZjVlMGMwYjAtNThkZS00YjYwLWJmMDEtNDkwNmRiY2IyMjllLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDI0VDIzNDQ1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZjMGM3NWMzOGM2ODhjYzg4ODlmMTZhMjQ0MmFlOTdkZDNhNDAzMDlhNWM3YmU1YjRhNGMwN2EzYmY4YzQ2ZGMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.pfPl_d--Bl09AD4h2asVXzKPRimo3y1bIlNpSOIUP54)
![Screenshot 2023-10-18 at 10 10 41](https://private-user-images.githubusercontent.com/34880758/276055888-033cd16c-876c-4c03-961e-590a4189bc9e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQwMDI1OTEsIm5iZiI6MTcxNDAwMjI5MSwicGF0aCI6Ii8zNDg4MDc1OC8yNzYwNTU4ODgtMDMzY2QxNmMtODc2Yy00YzAzLTk2MWUtNTkwYTQxODliYzllLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDI0VDIzNDQ1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk1MGMyN2ZlNDUyZGRjYmVhMjM2M2RlY2JiMzFkZmFiNjJkMjNlNzY0ODE2ZWMwZjUyOGY3ZWRjOWY3Yjk1MTcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.VK1HIWfOqgjyDsGovNDNnjAcL670z9vBOeRxSAhRcpE)
In comparison to GPT-4V without SoM, adding marks enables GPT-4V to ground the reasoning on detailed contents of the image (Left). Clear object cross-image references are observed on the right. 17
![Screenshot 2023-10-18 at 10 18 03](https://private-user-images.githubusercontent.com/34880758/276057292-8b112126-d164-47d7-b18c-b4b51b903d57.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQwMDI1OTEsIm5iZiI6MTcxNDAwMjI5MSwicGF0aCI6Ii8zNDg4MDc1OC8yNzYwNTcyOTItOGIxMTIxMjYtZDE2NC00N2Q3LWIxOGMtYjRiNTFiOTAzZDU3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDI0VDIzNDQ1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWViZmE2NTJiMzA3MDkzMTNkYzNkNTc5MWMxZWUwMjRmZTg0ODgzMTFlM2MwYjExODE0NDg4MDE5ZmQ3M2Q2YmMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.NzHahmGnPvPVTcizgJhhcc3O2-4Ij-r2UCsZz6TewcY)
Case study on solving CAPTCHA. GPT-4V gives the wrong answer with a wrong number of squares while finding the correct squares with corresponding marks after SoM prompting.
![Screenshot 2023-10-18 at 10 18 44](https://private-user-images.githubusercontent.com/34880758/276057392-dc753c3f-ada8-47a4-83f1-1576bcfb146a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQwMDI1OTEsIm5iZiI6MTcxNDAwMjI5MSwicGF0aCI6Ii8zNDg4MDc1OC8yNzYwNTczOTItZGM3NTNjM2YtYWRhOC00N2E0LTgzZjEtMTU3NmJjZmIxNDZhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDI0VDIzNDQ1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVkMjk0ODFlNDkwZWY0Nzc4MGU1Y2JhMDVlNWMyNWJlOTM3MTI5MDVlNTZjYjY2OWZlY2U2YmYyYzVjZWI4ZTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.O1-iaM343bBvP7p-wVtcE2bbaSu5mR9STar2jDRyv8I)
Case study on an image of dish for GPT-4V. GPT-4V does not produce a grounded answer with the original image. Based on SoM prompting, GPT-4V not only speaks out the ingredients but also corresponds them to the regions.
![Screenshot 2023-10-18 at 10 19 12](https://private-user-images.githubusercontent.com/34880758/276057456-88188c90-84f2-49c6-812e-44770b0c2ca5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQwMDI1OTEsIm5iZiI6MTcxNDAwMjI5MSwicGF0aCI6Ii8zNDg4MDc1OC8yNzYwNTc0NTYtODgxODhjOTAtODRmMi00OWM2LTgxMmUtNDQ3NzBiMGMyY2E1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDI0VDIzNDQ1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg1ZTM3ZDRmZDhlNDk0YjE5MGVkODFiMjU2NGRiNjQ5ZjAyNDljZDllYzJlZjZlZWI4MmVlYmYzYzdkZGRmYzEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0._BAQE6D6vuCMUlXKAhvoWRj0VV8k9sGZ2khKN2h5CxE)
SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks
![Screenshot 2023-10-18 at 10 19 39](https://private-user-images.githubusercontent.com/34880758/276057520-9b35b143-96af-41bd-ad83-9c1f1e0f322f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQwMDI1OTEsIm5iZiI6MTcxNDAwMjI5MSwicGF0aCI6Ii8zNDg4MDc1OC8yNzYwNTc1MjAtOWIzNWIxNDMtOTZhZi00MWJkLWFkODMtOWMxZjFlMGYzMjJmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDI0VDIzNDQ1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI0MzEyNDQwYjI3MDZlNWNhMjg4MGQ3YTc3YWYyN2VkNGQwMGU0MGE0MGEwZTA3YTFiOTgxN2M1YjkyZjZlY2ImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.dSplb_nQAYeaoNL9adC5O-FK4HHFZskgNnsbzu7EDfw)
![Screenshot 2023-10-18 at 10 20 03](https://private-user-images.githubusercontent.com/34880758/276057600-0bc86109-5512-4dee-aac9-bab0ef96ed4c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQwMDI1OTEsIm5iZiI6MTcxNDAwMjI5MSwicGF0aCI6Ii8zNDg4MDc1OC8yNzYwNTc2MDAtMGJjODYxMDktNTUxMi00ZGVlLWFhYzktYmFiMGVmOTZlZDRjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDI0VDIzNDQ1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWE1ODBhNjkzOTUwODZkMzM3OTAwZmNkODA5NDJhZGMzYWZhM2UyMTdjOTZmY2ZiZmIwYmM0YTQ1ZGM1YmUyZmMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.NcUxcZagoT5e7YXAgWJ7jP0lHzhHrMjmKrc-OE4fbN4)
GPT-4V with SoM gives a reasonable suggestion on how to achieve a goal in a gaming scenario.
![Screenshot 2023-10-18 at 10 21 24](https://private-user-images.githubusercontent.com/34880758/276057806-7f139250-5350-4790-a35c-444ec2ec883b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQwMDI1OTEsIm5iZiI6MTcxNDAwMjI5MSwicGF0aCI6Ii8zNDg4MDc1OC8yNzYwNTc4MDYtN2YxMzkyNTAtNTM1MC00NzkwLWEzNWMtNDQ0ZWMyZWM4ODNiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDI0VDIzNDQ1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ2Mjc1MTU5ZjI2NmMwZmIyNmMzNDBlMTViZWVjZWYwZTA0YjA0YzE1ZDZiZDRmOGZiOGVhOGZlMDFjZTA3NzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.m7ncgRbPbPJytol9q6K3EjAdQac1n7aIV0rrL56tLYs)
We conduct experiments on various vision tasks to verify the effectiveness of our SoM. Results show that GPT4V+SoM outperforms specialists on most vision tasks and is comparable to MaskDINO on COCO panoptic segmentation.
Our model adopts the following models to propose masks:
- Mask DINO
- SEEM
- Semantic-SAM
- Segment Anything for the SA-1B data.
We also thank GPT-4V for providing a strong foundational model!
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@article{yang2023setofmark,
title={Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V},
author={Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chunyuan Li and Jianfeng Gao},
journal={arXiv preprint arXiv:2310.11441},
year={2023},
}