This repository contains the data and code for the paper "Models See Hallucinations: Evaluating the Factuality in Video Captioning".
FactVC-main/
├── data/
│ ├── activitynet/
│ │ ├── videos/ # sampled ActivityNet videos
│ │ ├── frames/ # extracted video frames
│ │ ├── captions/ # ground-truth and model-generated captions
│ │ ├── vids.txt # video ids
│ │ └── factuality_annotation.json # human factuality annotation
│ ├── youcook2/
│ │ ├── videos/ # sampled YouCook2 videos
│ │ ├── frames/ # extracted video frames
│ │ ├── captions/ # ground-truth and model-generated captions
│ │ ├── vids.txt # video ids
│ │ └── factuality_annotation.json # human factuality annotation
│ └── extract_frames.py
├── metric/
│ ├── clip/
│ ├── emscore/
│ └── factvc_corr.py # code to compute FactVC score and correlation
└── pretrained_models
└── factvc_video.pth # our pretrained metric model
First, download the sampled ActivityNet videos and YouCook2 videos and unzip them into corresponding folders. Download the pretrained FactVC metric model and put it under pretrained_models/
folder.
Then, extract video frames at 1fps (used for computing FactVC metric scores):
cd data/
python extract_frames.py --dataset activitynet
python extract_frames.py --dataset youcook2
Now, you can compute the FactVC scores and the correlation between FactVC score and human annotation:
cd metric/
python factvc_corr.py --dataset activitynet
python factvc_corr.py --dataset youcook2
We acknowledge the EMScore project that we based on our work