NVlabs/VILA

Long context video module only

Opened this issue · 0 comments

Great works and research.

My question is simply if is it possible to use only the visual/video part (already pretrained on video dataset like kinetics) for fine-tuning on long video dataset e.g. to classify 1-minute or 2-minutes of video data.