GPT4Vis: A Python repository from picnicode

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

Wenhao Wu^1,2, Huanjin Yao^2,3, Mengxi Zhang^2,4, Yuxin Song², Wanli Ouyang⁵, Jingdong Wang²

¹The University of Sydney, ²Baidu, ³Tsinghua University, ⁴Tianjin University, ⁵The Chinese University of Hong Kong

This work delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. We center on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. To ensure a comprehensive evaluation, we have conducted experiments across three modalities—images, videos, and point clouds—spanning a total of 16 popular academic benchmark.

📣 I also have other cross-modal projects that may interest you ✨.

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
Accepted by AAAI 2023 & IJCV 2023 | [Text4Vis Code]
Wenhao Wu, Zhun Sun, Yuxin Song, Jingdong Wang, Wanli Ouyang

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Accepted by CVPR 2023 | [BIKE Code]
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Accepted by CVPR 2023 as 🌟Highlight🌟 | [Cap4Video Code]
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang

News

[Nov 28, 2023] We release our report in Arxiv.
[Nov 27, 2023] Our prompts have been released. Thanks for your star 😝.

Overview

An overview of 16 evaluated popular benchmark datasets, comprising images, videos, and point clouds.

Zero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.

Generated Descriptions from GPT-4

We have pre-generated descriptive sentences for all the categories across the datasets, which you can find in the GPT_generated_prompts folder. Enjoy exploring!
We've also provided the example script to help you generate descriptions using GPT-4. For guidance on this, please refer to the generate_prompt.py file. Happy coding! Please refer to the config folder for detailed information on all datasets used in our project.

Execute the following command to generate descriptions with GPT-4.

# To run the script for specific dataset, simply update the following line with the name of the dataset you're working with: 
# dataset_name = ["Dataset Name Here"]   # e.g., dtd
python generate_prompt.py

GPT-4V(ision) for Visual Recognition

We share an example script that demonstrates how to use the GPT-4V API for zero-shot predictions on the DTD dataset. Please refer to the GPT4V_ZS.py file for a step-by-step guide on implementing this. We hope it helps you get started with ease!

# GPT4V zero-shot recognition script. 
# dataset_name = ["Dataset Name Here"]   # e.g., dtd
python GPT4V_ZS.py

# We also provide a script for batch testing with each request (larger batch sizes may lead to instability).
python GPT4V_ZS_batch.py

Requirement

For guidance on setting up and running the GPT-4 API, we recommend checking out the official OpenAI Quickstart documentation available at: OpenAI Quickstart Guide.

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.

@article{GPT4Vis,
  title={GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?},
  author={Wu, Wenhao and Yao, Huanjin and Zhang, Mengxi and Song, Yuxin and Ouyang, Wanli and Wang, Jingdong},
  booktitle={arXiv preprint arXiv:2311.15732},
  year={2023}
}

🎗️ Acknowledgement

This evaluation is built on the excellent works:

CLIP: Learning Transferable Visual Models From Natural Language Supervision
GPT-4
Text4Vis: Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

We extend our sincere gratitude to these contributors.

👫 Contact

For any questions, please feel free to file an issue.

picnicode/GPT4Vis