/VSTAR

[ACL2023] VSTAR is a multimodal dialogue dataset with scene and topic transition information

Primary LanguagePython

VSTAR

This is the official implementation of the ACL 2023 paper "VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions"

Dataset

Schedule

  • release dialogues
  • release feature (resnet, rcnn)
  • release test data (2024.07.17)
  • release meta data (genres, keywords, storyline, characters: name, avatar)
  • release frames

Dialogues

support language: English, 简体中文

  • Downloads

Storage: Train (196M); Valid(11.6M); test (24M)

Links: BaiduNetDisk or GoogleDrive

  • Statistics
clips dialogues scene/clip topic/clip
Train 172,041 4,319,381 2.42 3.68
Val 9753 250,311 2.64 4.29
Test 9779 250,436 2.56 4.12
  • Format
{
	"dialogs":[
		{
			"clip_id": "Friends_S01E01_clip_000",
			"dialog": ["hi", ...],
			"scene": [1, 1, 1, 1, 1, 1, 2, 2, ...],
			"session": [1, 1, 1, 2, 2, 2, 3, 3, ...]
		},
		...
]
}

Feature

  • Downloads

Storage: RCNN(246.2G), RESNET(109G)

Links: BaiduNetDisk

  • Format

File Structure:

# [name of TV show]_S[season]_E[episode]_clip_[clip id].npy
├── Friends_S01E01
   └── Friends_S01E01_clip_000.npy
   └── Friends_S01E01_clip_001.npy
   └── ...
├── ...

ResNet:

# numpy.load("Friends_S01E01_clip_000.npy")
(num_of_frames * 1000)

RCNN:

# numpy.load("Friends_S01E01_clip_000.npy", allow_pickle=True).item()
{
	"feature": (9 * num_of_frames * 2048) # array(float32), feature top 9 objects
	"size": (num_of_frames * 2) # list(int), size of original frame
	"box": (9 * num_of_frames * 4) # array(float32), bbox
	"obj_id": (9 * num_of_frames) # list(int), object id
	"obj_conf": (9 * num_of_frames) # array(float32), object conference 
	"obj_num": (num_of_frames) # list(int), number of objects/frame
}
  • Feature Extraction Tools

Please Refer to OpenViDial_extract_features

Installation

pip install -r requirements.txt

Scene Segmentation

  • Preprocess

move train.json, valid.json, test.json to inputs/full directory

run following script to change the original to binary format to run our baseline smoothly (check in our paper)

cd inputs/full
python preprocess.py
  • Train
python train_seg.py \
	--video 1 \
	--exp_set EXP_LOG \
	--train_batch_size 4 \
  • Infer
python generate_seg.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 1 \

Topic Segmentation

  • Train
python train_seg.py \
	--video 0 \
	--exp_set EXP_LOG \
	--train_batch_size 4 \
  • Infer
python generate_seg.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 0 \

Dialogue Generation

To use coco_caption for evaluation, run the following script to generate the reference file:

cd inputs/full
python coco_caption_reformat.py

for the evaluation details, please refer to: https://github.com/tylin/coco-caption

  • Train
python train_gen.py \
	--train_batch_size 4 \
	--model bart \
	--exp_set EXP_LOG \
	--video 1 \
	--fea_type resnet \

  • Infer
python generate.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 1 \
	--sess 1 \
	--batch_size 4

Citation

@misc{wang2023vstar,
    title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions},
    author={Yuxuan Wang and Zilong Zheng and Xueliang Zhao and Jinpeng Li and Yueqian Wang and Dongyan Zhao},
    year={2023},
    eprint={2305.18756},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}