Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Project Page | Paper

We propose an inspiring multimodal CoT framework named Cantor, which features a perceptual decision architecture that effectively integrates visual context and logical reasoning to solve visual reasoning tasks.

overview

Getting Started

1. Installation

Git clone our repository and creating Gemini environment:

git clone https://github.com/ggg0919/cantor
cd cantor
pip install -q -U google-generativeai

2. Run Cantor Demo

python3 demo.py --query "Which month is the hottest on average in Detroit?" --image_path ./images/image.png --api_key "your Gemini's key"

--query: Quetion
--image_path: Image path
--api_key: Your Gemini key

ToDo

  • Release the data and evaluation code on ScienceQA.
  • Release the data and evaluation code on MathVista.

Cases

overview

Citation

@article{gao2024cantor,
  title={Cantor: Inspiring Multimodal Chain-of-Thought of MLLM},
  author={Gao, Timin and Chen, Peixian and Zhang, Mengdan and Fu, Chaoyou and Shen, Yunhang and Zhang, Yan and Zhang, Shengchuan and Zheng, Xiawu and Sun, Xing and Cao, Liujuan and Ji, Rongrong},
  journal={arXiv preprint arXiv:2404.16033},
  year={2024}
}