Our paper focuses on analyzing fast, frequent, and fine-grained action sequences from videos, which is an important and challenging task for both computer vision and multi-modal LLMs. Existing methods are inadequate in solving this particular problem. For instance, in sports broadcast videos, ambiguous temporal boundaries and transitions between actions make it hard and unnecessary to localize or segment timestamps precisely. Instead, we focus on accurately recognizing fine-grained action types. Besides, frame-wise temporal supervision may even hinder the performance due to over-segmentation. Moreover, having a large number of actions in a short time is especially challenging for video captioning. We build a new large-scale dataset
The code is tested in Linux (Ubuntu 22.04) with the dependency versions in requirements.txt.
Refer to the READMEs in the data directory for pre-processing and setup instructions.
To train a model, use python3 train_vid2seq.py <dataset_name> <frame_dir> -s <save_dir> -m <model_arch>
.
<dataset_name>
: supports finetennis, badmintonDB, finediving, finegym<frame_dir>
: path to the extracted frames<save_dir>
: path to save logs, checkpoints, and inference results<model_arch>
: feature extractor architecture (e.g., slowfast)
Training will produce checkpoints, predictions for the val
split, and predictions for the test
split on the best validation epoch.
Models and configurations can be found in f3ast-model. Place the checkpoint file and config.json file in the same directory.
To perform inference with an already trained model, use python3 test_e2e.py <model_dir> <frame_dir> -s <split> --save
. This will output results for 4 evaluation metrics (accuracy, edit score, success rate, and mean Average Precision).
Each dataset has plaintext files that contain the list of classes and events in each video: classes.txt
This is a list of the class names, one per line: {split}.json
This file contains entries for each video and its contained events.
[
{
"video": VIDEO_ID,
"num_frames": 518, // Video length
"events": [
{
"frame": 100, // Frame
"label": CLASS_NAME, // Event class
},
...
],
"fps": 25,
"width": 1280, // Metadata about the source video
"height": 720
},
...
]
Frame directory
We assume pre-extracted frames, that have been resized to 224 pixels high or similar. The organization of the frames is expected to be <frame_dir>/<video_id>/<frame_number>.jpg. For example,
video1/
├─ 000000.jpg
├─ 000001.jpg
├─ 000002.jpg
├─ ...
video2/
├─ 000000.jpg
├─ ...
Similar format applies to the frames containing objects of interest.