/GAIA

[NeurIPS2024 D&B Spotlight] GAIA: Rethinking Action Quality Assessment for AI-Generated Videos

Apache License 2.0Apache-2.0

GAIA: Rethinking Action Quality Assessment for AI-Generated Videos📽️📊

Assessing action quality is both imperative and challenging due to its significant impact on the quality of AI-generated videos

Zijian Chen1, Wei Sun1, Yuan Tian1, Jun Jia1, Zicheng Zhang1,
Jiarui Wang1, Ru Huang2, Xiongkuo Min1, Guangtao Zhai1*, Wenjun Zhang1
1Shanghai Jiao Tong University,   2East China University of Science and Technology
*Corresponding author

中文版速递:知乎

Motivation: 1. Action quality owns a significant impact on the quality of AI-generated videos. 2. Current action quality assessment (AQA) studies predominantly predominantly focus on domain-specific actions from real videos and collect coarse-grained expert-only human ratings on limited dimensions.

Release

  • [2024/9/26] 🔥🔥🔥 GAIA is accepted by NeurIPS2024 D&B track as a Spotlight paper. We will soon update the arxiv.
  • [2024/6/18] 🔥 The proposed GAIA dataset is online!! Download it by OneDrive or Baidu Netdisk using Code: ks51
  • [2024/6/17] 🔥 We upload the used action prompt in prompts_all.csv as well as its corresponding category (action_label.xlsx)
  • [2024/6/11] We are preparing the GAIA data and meta information.

  • [2024/6/6] Github repo for GAIA is online.

Info of GAIA Dataset

Download the GAIA (9,180 videos) from the released link (OneDrive or Baidu Netdisk using Code: ks51)

Video naming rules: (model name)_(action keyword).mp4

(action keyword) also serve as the index to search the corresponding action prompt in prompts_all.csv

GAIA
|
|--videos
|  |-- Anmidiff_Abseiling.mp4
|  |-- Anmidiff_Admiration.mp4
|  |-- ...
|  |-- zeroScope_Zumba.mp4
|
|-- MOS.csv

Info of MOS.csv

| filename | final action subject | final action completeness | final action interaction |

| Anmidiff_Abseiling.mp4 | 49.0098 | 46.9289 | 52.1406 |

| ... | ... | ... | ... |

Dataset Construction

In this work, we opt to collect annotations from a novel causal reasoning syllogism-based perspective. We decompose an action process into three parts: 1) action subject as major premise, 2) action completeness as minor premise, and 3) action-scene interaction as conclusion. The rationales for this strategy are as follows: (a) As the visual saliency information in action-oriented videos, the rendering quality of the action subject can profoundly affect the visibility of the action, while humans excel at perceiving such generated artifacts. (b) Moreover, unlike parallel-form feedbacks, the order of these three parts in action syllogism inherently aligns with the human reasoning process.

As a result, a total of 971,244 ratings among 9,180 video-action pairs were collected.

Glance at the Performance of T2V Models in Action Generation

We evaluate 18 popular text-to-video (T2V) models on their ability to generate visually rational actions, revealing their pros and cons on different categories of actions.

Model-wise Comparison

For open-source lab studies, VideoCrafter2 takes the first place. For large-scale commercial applications, Morph Studio and Stable Video take the first and second place.

Class-wise Comparison

Existing T2V models struggle to render actions with drastic motion changes, where atypical body postures are more easily involved. Additionally, when it comes to the local hand action categories, the actions contain subtle movements receive significantly lower MOSs than others, showing the inferior capacity of generating fine-grained actions.

Performance Benchmark on GAIA

All-Combined indicates that we sum the MOS of three dimensions and rescale it to [$0,100$] as the overall action quality score. $\spadesuit$, $\clubsuit$, $\diamondsuit$, and $\heartsuit$ denote the evaluated conventional AQA method, action-related metrics, VQA methods, and video-text alignment metrics, respectively.

Quick Access of T2V Models

Model Code/Project Link
Text2Video-Zero https://github.com/Picsart-AI-Research/Text2Video-Zero
ModelScope https://modelscope.cn/models/iic/text-to-video-synthesis/summary
ZeroScope https://huggingface.co/cerspense/zeroscope_v2_576w
LaVie https://github.com/Vchitect/LaVie
Show-1 https://github.com/showlab/Show-1
Hotshot-XL https://github.com/hotshotco/Hotshot-XL
AnimateDiff https://github.com/guoyww/AnimateDiff
VideoCrafter1-512 / VideoCrafter1-1024 / VideoCrafter2 https://github.com/AILab-CVC/VideoCrafter
Mora https://github.com/lichao-sun/Mora
Gen-2 https://research.runwayml.com/gen2
Genmo https://www.genmo.ai
Pika https://pika.art/home
NeverEnds https://neverends.life
MoonValley https://moonvalley.ai
Morph Studio https://www.morphstudio.com
Stable Video https://www.stablevideo.com/welcome

Contact

Please contact the first author of this paper for queries.

  • Zijian Chen, zijian.chen@sjtu.edu.cn

Citation

If you find our work interesting, please feel free to cite our paper:

@article{chen2024gaia,
  title={GAIA: Rethinking Action Quality Assessment for AI-Generated Videos},
  author={Chen, Zijian and Sun, Wei and Tian, Yuan and Jia, Jun and Zhang, Zicheng and Wang, Jiarui and Huang, Ru and Min, Xiongkuo and Zhai, Guangtao and Zhang, Wenjun},
  journal={arXiv preprint arXiv:2406.06087},
  year={2024}
}