For different video datasets, is the frame density always drawn at intervals of 1 second?

Question

For different video datasets, is the frame density always drawn at intervals of 1 second?

Closed this issue 5 months ago · 5 comments

DuoLong commented 6 months ago

Answer 1 · 2023-12-27T13:40:15.000Z

Hi, thanks for your interest.

As mentioned in Section 4.1 of our paper, we sample 96 frames uniformly for each dataset.

Answer 2 · 2024-04-25T09:23:00.000Z

@RenShuhuai-Andy Is there a way to increase the sampled frame if I want to use the same inference checkpoint for custom videos from the eval configs?

Answer 3 · 2024-04-25T11:27:54.000Z

Hi, @rahulkrprajapati

If you want to use more frames in TimeChat, you can change max_frame_pos, n_frms, and num_frm from 96 to any number larger than 96, then directly conduct inference. However, this will cause more gpu memory and I‘m not sure about the performance under this situation. We will explore a more strong TimeChat with long-context capbility in the future.

Answer 4 · 2024-04-25T16:03:01.000Z

Hey @RenShuhuai-Andy , thank you for the quick response. After I make these changes :

print('Initializing Chat')
args = parse_args()
cfg = Config(args)

DIR="ckpt/timechat"
MODEL_DIR=f"{DIR}/timechat_7b.pth"

model_config = cfg.model_cfg
model_config.device_8bit = args.gpu_id
model_config.ckpt = MODEL_DIR
model_cls = registry.get_model_class(model_config.arch)
model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))
model.eval()

vis_processor_cfg = cfg.datasets_cfg.webvid.vis_processor.train
vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)

is this warning expected:

video_frame_position_embedding size is not the same, interpolate from torch.Size([96, 768]) to torch.Size([192, 7

INFO:root:use gradient checkpointing for LLAMA
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
INFO:root:Loading LLAMA Done
INFO:root:Using LORA
INFO:root:Loading LLAMA proj
INFO:root:LLAMA proj is frozen
INFO:root:Loading llama_proj Done
trainable params: 0 || all params: 6,771,970,048 || trainable%: 0.0
INFO:root:video_Qformer is frozen
Load first Checkpoint: ckpt/timechat/timechat_7b.pth
video_frame_position_embedding size is not the same, interpolate from torch.Size([96, 768]) to torch.Size([192, 7

Answer 5 · 2024-04-25T16:19:16.000Z

yes, TimeChat will conduct position embedding interpolation if the number of frames is larger than 96.