MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems

Dataset Description

MMCode is a multi-modal code generation dataset designed to evaluate the problem-solving skills of code language models in visually rich contexts. It contains 3,548 questions paired with 6,622 images, derived from real-world programming challenges across 10 code competition websites, with Python solutions and tests provided. The dataset emphasizes the extreme demand for reasoning abilities, the interwoven nature of textual and visual contents, and the occurrence of questions containing multiple images.

For more detailed introduction of the data, please see the 🤗 Huggingface Dataset.

Getting Started

Set Up

Before you begin, ensure your environment variables are set:

OPENAI_API_KEY: Your OpenAI API key.
GOOGLE_API_KEY: Your Google API key.

Inference

An example for GPT-4V generation:

python generate.py \
    --model gpt4v \
    --problems_root <path_to_the_test_set> \
    --save_path "results/gpt4v-mmcode_test.jsonl"

Evaluation

To evaluate the results generated by GPT-4V, run:

python eval.py \
    --problems_root <path_to_the_test_set> \
    --generation_file "results/gpt4v-mmcode_test.jsonl"

Citation

Please consider citing if you find our work useful:

@misc{li2024mmcode,
      title={MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems}, 
      author={Kaixin Li and Yuchen Tian and Qisheng Hu and Ziyang Luo and Jing Ma},
      year={2024},
      eprint={2404.09486},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}