Code for Video Pose Distillation
See our project website for the paper and details. Published in ICCV 2021.
@inproceedings{vpd_iccv21,
author={Hong, James and Fisher, Matthew and Gharbi, Micha\"{e}l and Fatahalian, Kayvon},
title={{V}ideo {P}ose {D}istillation for {F}ew-{S}hot, {F}ine-{G}rained {S}ports {A}ction {R}ecognition},
booktitle={ICCV},
year={2021}
}
For code in this repository, see LICENSE.
Usage
This repository contains code for VPD and VIPE*, as described in our paper.
VIPE*
To apply the VIPE* model:
./apply_vipe_model.py <pose_dir> <model_dir> -o <out_dir>
- pose_dir : the directory containing the 2D poses for each video
- model_dir : path to trained model
- out_dir : path to save features to
To train a VIPE* model see train_vipe_model.py
.
Example: ./train_vipe_model.py --dataset 3d --save_dir <model_dir>
Preprocessed 3D pose data for training is available here: VIPE-data.zip.
This archive includes ground truth 3D pose and 2D pose from different camera views.
Extract to data/vipe
or update the paths in vipe_dataset_paths.py
.
For details on preprocessing, see preprocess_3d_pose.py.
A pre-trained VIPE model is available: VIPE-model.zip.
VPD
Data preparation
To prepare the sports datasets, there are several steps:
- Fetching the videos
- Pose detection / tracking
- Extracting crops (see extract_square_crops.py)
- Computing optical flow (see raft/README.md)
Our pose and tracking annotations can be found here: URL
For the source videos:
- Diving48 : see original authors' website
- Floor exercise : obtain from FineGym authors, recut using recut_finegym_video.py. If using our pose annotations, make sure the frame rates match for each video or adapt accordingly.
- Figure skating : see fs-videos.csv and recut_fs_video.py
- Tennis : see tennis-videos.csv
It is recommended to unzip the files to the paths defined in video_dataset_paths.py
or to update those paths to where the pose files are stored. For example:
diving48
|---pose
|---crops
\---videos
fs
|---pose
|---crops
\---videos
...
To train a student model:
./train_vpd_model.py <dataset> --save_dir <model_dir> --emb_dir <teacher_dir> --flow_img <flow_name> --motion
- dataset : the sports dataset to specialize to (e.g.,
fs
) - model_dir : path to save models to
- flow_name : the name of the flow images for the crops, which have names
<frame_no>.<flow_name>.png
- teacher_dir : path to the teacher's features
To apply a student model:
./apply_vpd_model.py <model_dir> -d <dataset> -o <out_dir> --flow_img <flow_name>
- model_dir : path to the trained model
- out_dir : path to save features to
- flow_name : should be the same used for training
The student maintains the same output file formats as the teacher.
Downstream tasks:
For action recognition:
./recognize.py -d <dataset> <feature_dir>
- dataset : the sports dataset
- feature_dir : the directory containing the pose features
See options such as --retrieve
for the retrieval task.
For detection, see detect.py
.
Pre-trained VPD and VIPE* features/embeddings are available at URL.
To use the Diving48 and FineGym (Floor Exercise) datasets, you need to download the labels per the READMEs in the diving48/data
and finegym/data
subdirectories.
Data formats
Video naming conventions
For Diving48 and FineGym, we maintain the original authors' video naming scheme.
For figure skating, videos (routines) are named by <video>_<number>_<start_frame>_<end_frame>.mp4
.
For tennis, videos (points) are named by: <video>_<start_frame>_<end_frame>.mp4
. Pose for each video is prefixed by front__
or back__
to denote the player.
2D pose format
Pose for each video is organized as follows:
men_olympic_short_program_2010_01_00011475_00015700
|---boxes.json
|---coco_keypoints.json.gz
|---mask.json.gz
\---meta.json
The format for boxes.json
is:
[
[frame_num, [x, y, w, h]], ...
]
The format coco_keypoints.json.gz
is:
[
[
frame_num, [[score, [x, y, w, h], [[x, y, score] * 17]]], ...]
],
...
]
The format of mask.json.gz
:
[
[
frame_num, [[score, [x, y, w, h], base64_encoded_png], ...]
],
...
]
Crop directories
Crops around the athlete, for training VPD, are extracted per video (see extract_square_crops.py):
men_olympic_short_program_2010_01_00011475_00015700
|---0.png // <frame_num>.png
|---0.prev.png
|---0.flow.png
|---0.mask.png
|---1.png
|---1.prev.png
...
For tennis, the format is slightly different:
usopen_2015_mens_final_federer_djokovic
|---back
| |---0.png // <frame_num>.png
| |---0.prev.png
| |---0.flow.png
| |---0.mask.png
| ...
|
\---front
|---0.png
|---0.prev.png
|---0.flow.png
|---0.mask.png
...
Features / embedding format
Embeddings are stored as pickle files, one per video. The format for each video is:
[
(frame_num, ndarray, {metadata dict}), ...
]
The ndarray may be 1D or 2D, depending on data augmentation (e.g., flip).