/NSVA

The official implementation of our paper "Sports Video Analysis on Large-scale Data" (https://arxiv.org/abs/2208.04897)

Primary LanguagePython

Sports Video Analysis on Large-Scale Data (Accepted by ECCV2022)

Dekun Wu1*, He Zhao2*, Xingce Bao3, Richard P. Wildes2,

1University of Pittsburgh    2York University    3EPFL   

* Equal Contribution

Abstract: This paper investigates the modeling of automated machine description on sports video, which has seen much progress recently. Nevertheless, state-of-the-art approaches fall quite short of capturing how human experts analyze sports scenes. In this paper, we propose a novel large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on captioning, to address the above challenges. We also design a unified approach to process raw videos into a stack of meaningful features with minimum labelling efforts, showing that cross modeling on such features using a transformer architecture leads to strong performance. In addition, we demonstrate the broad application of NSVA by addressing two additional tasks, namely fine-grained sports action recognition and salient player identification.

Algorithm outline

Approach: Our approach relies on feature representations extracted from multiple orthogonal perspectives, we adopt the framework of UniVL [1], a network designed for cross feature interactive modeling, as our base model. It consists of four transformer backbones that are responsible for coarse feature encoding (using TimeSformer [2]), fine-grained feature encoding (e.g., basket, ball, players), cross attention and decoding, respectively.

Code Overview

The following sections contain scripts or PyTorch code for:

  • A. Download pre-processed NSVA dataset.
  • B. Training/evaluation script: (1) video captioning, (2) action recognition and (3) player identification.
  • C. Pre-trained weigths.

Install Dependencies

  • python==3.6.9
  • torch==1.7.0+cu92
  • tqdm
  • boto3
  • requests
  • pandas
  • nlg-eval (Install Java 1.8.0 (or higher) firstly)
conda create -n sportsformer python=3.6.9 tqdm boto3 requests pandas
conda activate sportsformer
pip install torch==1.7.1+cu92
pip install git+https://github.com/Maluuba/nlg-eval.git@master

This code assumes CUDA support.

Prepare the Dataset

Information about dataset preparation can be found at this link.

Video captioning

Run the following code for training/evaluating from scratch video description captioning

cd SportsFormer
python -m torch.distributed.launch --nproc_per_node 4 main_task_caption.py --do_train --num_thread_reader 0 --epochs 20 --batch_size 48 --n_display 300 --train_csv data/ourds_train.44k.csv --val_csv data/ourds_JSFUSION_test.csv --data_path data/ourds_description_only.json --features_path data/ourds_videos_features.pickle --bbx_features_path data/cls2_ball_basket_sum_concat_original_courtline_fea.pickle --output_dir ckpt_ourds_caption --bert_model bert-base-uncased --do_lower_case --lr 3e-5 --max_words 30 --max_frames 30 --batch_size_val 1 --visual_num_hidden_layers 6 --decoder_num_hidden_layer 3 --cross_num_hidden_layers 3 --datatype ourds --stage_two --video_dim 768 --init_model weight/univl.pretrained.bin --train_tasks 0,0,1,0 --test_tasks 0,0,1,0

Or evalute with our pre-trained model at weights folder:

python -m torch.distributed.launch --nproc_per_node 4 main_task_caption.py --do_eval --num_thread_reader 0 --epochs 20 --batch_size 48 --n_display 300 --train_csv data/ourds_train.44k.csv --val_csv data/ourds_JSFUSION_test.csv --data_path data/ourds_description_only.json --features_path data/ourds_videos_features.pickle --bbx_features_path data/cls2_ball_basket_sum_concat_original_courtline_fea.pickle --output_dir ckpt_ourds_caption --bert_model bert-base-uncased --do_lower_case --lr 3e-5 --max_words 30 --max_frames 30 --batch_size_val 1 --visual_num_hidden_layers 6 --decoder_num_hidden_layer 3 --cross_num_hidden_layers 3 --datatype ourds --stage_two --video_dim 768 --init_model weight/best_model_vcap.bin --train_tasks 0,0,1,0 --test_tasks 0,0,1,0

Results reproduced from pre-trained model

Description Captioning C M B@1 B@2 B@3 B@4 R_L
Our full model 1.1329 0.2420 0.5219 0.4080 0.3120 0.2425 0.5101

Action recognition

Run the following code for training/evaluating from scratch video description captioning

cd SportsFormer
env CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 ./main_task_action_multifeat_multilevel.py

Results reproduced from pre-trained model

Action Recognition SuccessRate mAcc. mIoU
Our full model Coarse 60.14 61.20 76.61
Our full model Fine 46.88 51.25 57.08
Our full model Event 37.67 42.34 46.45

Player identification

Run the following code for training/evaluating from scratch video description captioning

cd SportsFormer
env CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 ./main_task_player_multifeat.py

Results reproduced from pre-trained model

Play Identification SuccessRate mAcc. mIoU
Our full model 4.63 6.97 6.86

Video downloading tools

If you would like to download the raw mp4 videos that we use for our dataset, you can use the following code

cd tools
python collect_videos.py

If you want to download other videos from NBA.com, you can use the following code

cd tools
python download_video_by_gameid_eventid_date.py

Citation

If you find this code useful in your work then please cite

@inproceedings{dew2022sports,
  title={Sports Video Analysis on Large-Scale Data},
  author={Wu, Dekun and Zhao, He and Bao, Xingce and Wildes, Richard P.},
  booktitle={ECCV},
  month = {Oct.},
  year={2022}
}

Acknowledgement

This code base is largely from UniVL. Many thanks to the authors.

License

The majority of this work which includes code and data is licensed under Creative Commons Attribution-NonCommercial (CC-BY-NC) license. However part of the project is available under a separate license term: UniVL is licensed under the MIT license.

Contact

Please contact Dekun Wu @ dew104@pitt.edu or He Zhao @ zhufl@eecs.yorku.ca, if any issue.

References

[1] H. Luo et al. "UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation " Arxiv'2020.

[2] G Bertasiuset al. "Is space-time attention all you need for video understanding?." ICML'2021