This paper has been accepted to IEEE UEMCON 2024.
This source code is inspired by Koh et al.: https://github.com/kohjingyu/fromage
@inproceedings{koh2023grounding, title={Grounding Language Models to Images for Multimodal Inputs and Outputs}, author={Koh, Jing Yu and Salakhutdinov, Ruslan and Fried, Daniel}, journal={ICML}, year={2023} }