Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

How LMMs Perform on Video Quality Understanding?

Zicheng Zhang¹^*, Ziheng Jia¹^*, Haoning Wu²^#, Chunyi Li¹, Zijian Chen¹ Yingjie Zhou¹,

Wei Sun¹, Xiaohong Liu¹, Xiongkuo Min¹, Weisi Lin², Guangtao Zhai¹^#

¹Shanghai Jiaotong University, ²Nanyang Technological University

^*Equal contribution. ^#Corresponding author.

Paper | Github | Data

We introduce Q-Bench-Video, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality.

A. To ensure the diversity of video sources, Q-Bench-Video encompasses videos from natural scenes, computer graphics (CG), and AI-generated content (AIGC).

B. Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness.

C. Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation.

Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human-level performance.

Release

[2024/10/2] 🔥 Release the technical report for Q-Bench-Video.
[2024/9/24] 🔥 Release the sample script for testing on Q-Bench-Video.
[2024/9/20] 🔥 Github repo for Q-Bench-Video is online. Do you want to find out how your LMM performs on video quality understanding? Come and test on Q-Bench-Video !! Dataset Download

Q-Bench-Video Construction

In this benchmark, the meta-structure tuple (V,Q,A,C) of each data item can be decomposed into several components: the video object V (which can be a single video or a pair of videos), the video quality query Q, the set of possible answers A, and the correct answer C. The subcategories are organized as follows:

Question Types: Yes-or-No Questions, What-How Questions, Open-ended Questions
Quality Concerns: Technical, Aesthetic, Temporal, and AIGC
Single Videos & Video Pairs: Single-Global, Single-referring, Pair-Compare-Coarse, Pair-Compare-Fine

Glance at Q-Bench-Video Performance

For open-source models, mPLUG-Owl3 takes the first place. For closed-source models, GPT-4o takes the first place.

A Quick Look of the Q-Bench-Video Outcomes.

Model	Yes-or-No	What-How	Open-ended	Technical	Aesthetic	Temporal	AIGC	Overall
Random guess w/o Open-ended	50.00%	25.00%	/	37.10%	37.31%	37.25%	37.22%	37.79%
Human	86.57%	81.00%	77.11%	79.22%	80.23%	82.72%	86.21%	81.56%
Open-source Image LMMs
LLaVA-Next (Mistral-7B)	62.83%	45.14%	33.69%	46.38%	57.86%	47.84%	48.46%	47.52%
LLaVA-v1.5 (Vicuna-v1.5-13B)	52.98%	46.44%	37.01%	45.77%	58.12%	45.30%	46.48%	45.64%
mPLUG-Owl2 (LLaMA2-7B)	59.19%	39.07%	31.19%	42.07%	52.38%	41.71%	39.37%	43.43%
Open-source Video LMMs
mPLUG-Owl3 (Qwen2-7B)	60.48%	56.39%	39.48%	52.68%	58.31%	52.05%	43.49%	52.39%
LLaVA-OneVision (Qwen2-7B)	61.34%	53.88%	39.15%	49.35%	64.15%	50.68%	44.30%	51.70%
InternVL-Chat (Vicuna-7B)	66.02%	52.13%	33.93%	48.42%	52.73%	50.59%	53.12%	51.11%
VILA1.5 (LLaMA3-8B)	61.95%	46.00%	39.60%	47.85%	57.85%	45.65%	42.57%	49.41%
PLLaVA (Mistral-7B)	65.63%	52.33%	32.23%	49.69%	61.32%	50.96%	53.64%	50.39%
LLaVA-Next-Video (Mistral-7B)	61.34%	45.95%	38.10%	49.03%	60.94%	46.97%	49.40%	48.69%
ST-LLM (Vicuna-v1.1-7B)	44.63%	28.50%	32.78%	34.99%	46.11%	34.28%	34.02%	35.42%
Video-LLaVA (Vicuna-v1.5-7B)	64.67%	40.79%	29.11%	43.25%	54.04%	42.38%	42.76%	43.49%
VideoChat2 (Mistral-7B)	56.09%	29.98%	34.99%	39.26%	50.02%	38.25%	35.88%	40.56%
Proprietary LMMs
Gemini 1.5 Flash	65.48%	56.79%	47.51%	54.11%	66.58%	53.51%	50.22%	56.78%
Gemini 1.5 Pro	65.42%	62.35%	47.57%	56.80%	69.61%	53.38%	53.26%	58.63%
GPT-4o mini	62.95%	50.93%	42.10%	49.38%	60.90%	48.43%	41.71%	52.20%
GPT-4o	67.48%	58.79%	49.25%	56.01%	58.57%	65.39%	52.22%	58.70%
GPT-4 Turbo	66.93%	58.33%	40.15%	54.23%	66.23%	54.00%	52.04%	56.36%

We release the performance of LMMs against humans. Several conclusions can be obtained:

General Performance. Human>Proprietary LMMs>Open-source LMMs>Random guess.
Open-ended questions are more challenging for LMMs.
MMs exhibit unbalanced performance across different types of distortions.

Evaluate your model on Q-Bench-Video

Test on Q-Bench-Video

Assume that you have downloaded the Q-Bench-Video.

We provide a sample GPT_test.py of testing GPT API format on Q-Bench-Video

Use the following command or make necessary modifications to quickly test GPT and your LMM models.

python GPT_test.py --json_file path/to/Q_Bench_Video_dev.json --video_dir path/to/video/directory --output_file path/to/Q_Bench_Video_dev_response.json --api_key your_openai_api_key

Evaluating open-ended responses:

We provide a sample function to rate the open-ended responses, result' is the list of scores while score' is the standardized score.

from openai import OpenAI
client = OpenAI(api_key = "your_api_key")
import timeout_decorator

def judge_open(question, answer, correct_ans):
    result = []
    msg = f'''Given the question [{question}], evaluate whether the response [{answer}] completely matches the correct answer [{correct_ans}]. 
First, check the response and please rate score 0 if the response is not a valid answer.
Please rate score 2 if the response completely or almost completely matches the correct answer on completeness, accuracy, and relevance. 
Please rate score 1 if the response partly matches the correct answer on completeness, accuracy, and relevance.
Please rate score 0 if the response doesn't match the correct answer on completeness, accuracy, and relevance at all.
Please only provide the result in the following format: Score:'''
    print(msg)
    @timeout_decorator.timeout(5)
    def get_completion(msg):
        completion = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that grades answers related to visual video quality. There are a lot of special terms or keywords related to video processing and photography. You will pay attention to the context of `quality evaluation' when grading."},
                {"role": "user", "content": msg}
            ]
        )
        return completion.choices[0].message.content
    
    for i in range(5):
        try:
            response = get_completion(msg)
            print(response)
            score = response.split(": ")[-1]
            result.append(score)
        except timeout_decorator.TimeoutError:
            result.append("N/A")
        except Exception as e:
            result.append("N/A")

    for i in result:
        if i in ['0','1','2']:
            cnt = cnt + 1
            score = score + float(i)/2
        score = score / cnt
    return result, score

Contact

Please contact any of the first authors of this paper for queries.

Zicheng Zhang, zzc1998@sjtu.edu.cn, @zzc-1998