/GPT4Vis

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

Primary LanguagePythonMIT LicenseMIT

If you like our project, please give us a star ⭐ on GitHub for latest update.

arXiv zhihu

Wenhao Wu1,2, Huanjin Yao2,3, Mengxi Zhang2,4, Yuxin Song2, Wanli Ouyang5, Jingdong Wang2

1The University of Sydney, 2Baidu, 3Tsinghua University, 4Tianjin University, 5The Chinese University of Hong Kong


This work delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. We center on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. To ensure a comprehensive evaluation, we have conducted experiments across three modalities—images, videos, and point clouds—spanning a total of 16 popular academic benchmark.

📣 I also have other cross-modal projects that may interest you ✨.

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Wenhao Wu, Zhun Sun, Wanli Ouyang
Conference Journal github

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
Conference github

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
Accepted by CVPR 2023 as 🌟Highlight🌟 | Conference github

News

  • [Mar 7, 2024] Due to the recent removal of RPD (request per day) limits on the GPT-4V API, we've updated our predictions for all datasets using standard single testing (one sample per request). Check out the GPT4V Results, Ground Truth and Datasets we've shared for you! As a heads-up, 😭running all tests once costs around 💰$4000+💰.
  • [Nov 28, 2023] We release our report in Arxiv.
  • [Nov 27, 2023] Our prompts have been released. Thanks for your star 😝.

Overview

An overview of 16 evaluated popular benchmark datasets, comprising images, videos, and point clouds.

Zero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.

Generated Descriptions from GPT-4

  • We have pre-generated descriptive sentences for all the categories across the datasets, which you can find in the GPT_generated_prompts folder. Enjoy exploring!

  • We've also provided the example script to help you generate descriptions using GPT-4. For guidance on this, please refer to the generate_prompt.py file. Happy coding! Please refer to the config folder for detailed information on all datasets used in our project.

  • Execute the following command to generate descriptions with GPT-4.

    # To run the script for specific dataset, simply update the following line with the name of the dataset you're working with: 
    # dataset_name = ["Dataset Name Here"]   # e.g., dtd
    python generate_prompt.py

GPT-4V(ision) for Visual Recognition

  • We share an example script that demonstrates how to use the GPT-4V API for zero-shot predictions on the DTD dataset. Please refer to the GPT4V_ZS.py file for a step-by-step guide on implementing this. We hope it helps you get started with ease!

    # GPT4V zero-shot recognition script. 
    # dataset_name = ["Dataset Name Here"]   # e.g., dtd
    python GPT4V_ZS.py
  • All results are available in the GPT4V_ZS_Results folder! In addition, we've provided the Datasets link along with their corresponding ground truths (annotations folder) to help readers in replicating the results. Note: For certain datasets, we may have removed prefixes from the sample IDs. For instance, in the case of ImageNet, "ILSVRC2012_val_00031094.JPEG" was modified to "00031094.JPEG".

DTD EuroSAT SUN397 RAF-DB Caltech101 ImageNet-1K FGVC-Aircraft Flower102
57.7 46.8 59.2 68.7 93.7 63.1 56.6 69.1
Label Label Label Label Label Label Label Label
Stanford Cars Food101 Oxford Pets UCF-101 HMDB-51 Kinetics-400 ModelNet-10
62.7 86.2 90.8 83.7 58.8 58.8 66.9
Label Label Label Label Label Label Label
  • With the provided prediction and annotation files, you can reproduce our top-1/top-5 accuracy results with the calculate_acc.py script.

    # pred_json_path = 'GPT4V_ZS_Results/imagenet.json'
    # gt_json_path = 'annotations/imagenet_gt.json'
    python calculate_acc.py

Requirement

For guidance on setting up and running the GPT-4 API, we recommend checking out the official OpenAI Quickstart documentation available at: OpenAI Quickstart Guide.

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.

@article{GPT4Vis,
  title={GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?},
  author={Wu, Wenhao and Yao, Huanjin and Zhang, Mengxi and Song, Yuxin and Ouyang, Wanli and Wang, Jingdong},
  booktitle={arXiv preprint arXiv:2311.15732},
  year={2023}
}

🎗️ Acknowledgement

This evaluation is built on the excellent works:

  • CLIP: Learning Transferable Visual Models From Natural Language Supervision
  • GPT-4
  • Text4Vis: Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

We extend our sincere gratitude to these contributors.

👫 Contact

For any questions, please feel free to file an issue.