Empowering Multimodal LLMs with Set-of-Mark Prompting and Improved Visual Reasoning Ability.
Primary LanguagePython