This repository hosts the AI Hub's sports video dataset focused on baseball. The dataset comprises labeled video data specifically designed for training and validating AI models in sports analytics, particularly baseball. You can find the original dataset here.
The dataset is organized into multiple categories, each corresponding to specific baseball actions and movements:
baseball_ra
: Includes only 'ra' annotations.baseball_ra_ct
: Catcher throwing actions.baseball_ra_ff
: Fielder catching a fly ball.baseball_ra_fg
: Fielder catching a ground ball.baseball_ra_ft
: Fielder throwing actions.baseball_ra_hb
: Hitter bunting.baseball_ra_hh
: Hitter hitting.baseball_ra_hs
: Hitter swinging.baseball_ra_pb
: Pitcher balk movements.baseball_ra_po
: Pitcher overhand throws.baseball_ra_pp
: Pitcher pick-off throws.baseball_ra_ps
: Pitcher side-arm throws.baseball_ra_pu
: Pitcher underhand throws.baseball_ra_rr
: Runner running.
The dataset includes:
- Training Set: 3,888 video clips.
- Validation Set: 504 video clips.
Each action is represented by a series of image frames, which are provided in separate folders.
The dataset consists of 13 distinct columns, capturing various aspects of each video frame.
Given that the dataset is initially structured as image frames, we provide a script img2video.py
for converting these frames into continuous video files. This conversion is essential for further processing, including skeleton value extraction using tools like MMAction2.
Detailed usage instructions for dataset preprocessing, analysis, and model training are forthcoming.
We utilize the MMAction2 library for extracting skeleton values from each action video. The process involves using the ntu_pose_extraction.py
script available in MMAction2. This script is specifically designed to extract human poses in a format suitable for action recognition tasks.
The ntu_pose_extraction.py
script (source code) is employed for processing each video file in our dataset.
python ntu_pose_extraction.py --input video_file.avi --output output.pkl
To facilitate the extraction process across all video files in our dataset, we have created a script, extract_pkl.py
. This script automates the pose extraction for each video and compiles the results into a single PKL file, which can be directly used as input for modeling.
The PKL files follow the format specified in the MMAction2 documentation, which can be found here.
Further instructions on how to use the extracted skeleton data for training models using MMAction2 will be provided in subsequent updates.
PoseC3D is a skeleton-based action recognition model, which can be found in the MMAction2 model zoo. This model leverages the power of 3D pose estimation to recognize and classify human actions in videos.
The aim is to train the PoseC3D model using our custom dataset derived from AI_hub. The process involves:
- Utilizing pre-trained human detectors and pose estimators.
- Training the model on the custom dataset to generate labels for video objects.
To prepare the AI_hub dataset for PoseC3D model training, follow these steps:
-
Video Conversion: Since the AI_hub dataset is composed of individual image frames, it is necessary to convert these frames into video format, corresponding to each action.
-
Data Extraction and Compression: Extract skeleton and other relevant information from the converted videos. This data is then compressed into a single file.
-
Model Training: Use the compressed file to train the PoseC3D model.
Modify the existing model configuration file slowonly_r50_8xb16-u48-240e_gym-keypoint.py
located at configs/skeleton/posec3d/
. Adjustments include setting the num_classes
to 13 and specifying the annotation file (ann_file
) as data/skeleton/output/custom.pkl
.
By following these steps and configurations, the PoseC3D model can be effectively trained on the AI_hub baseball dataset, enabling accurate action recognition in sports video analysis.
To train the PoseC3D model on the custom dataset (AI_hub), follow these steps in Google Colab:
-
Run the Training Command: Execute the training process using the following command in Colab:
!python tools/train.py configs/skeleton/posec3d/slowonly_r50_u48_240e_customdata_xsub_keypoint.py \ --work-dir work_dirs/slowonly_r50_u48_240e_customdata_xsub_keypoint \ --seed 0 \ --resume
- Model: PoseC3D
- Input:
custom.pkl
(generated from the AI_hub dataset)
- Platform: Google Colab Pro+
- GPU: NVIDIA A100
- Training Duration: Approximately 6.3 hours
The model's performance was evaluated using the validation dataset as the test data. The results are as follows:
- Accuracy (Top-1): 0.9636
- Accuracy (Top-5): 0.9838
- Mean Accuracy: 0.9717
The best performing model checkpoint is saved as best_acc_top1_epoch_22.pth
.
The following labels were used for the classification task, as specified in label_map_custom.txt
:
- Catcher throw
- Fielder catch a fly ball
- Fielder catch a ground ball
- Fielder fielder throw
- Hitter bunt
- Hitter hitting
- Hitter swing
- Pitcher balk
- Pitcher overhand throw
- Pitcher pick off throw
- Pitcher side arm throw
- Pitcher underhand throw
- Runner run
This section outlines the detailed process of training the PoseC3D model using the custom dataset from AI_hub, including the training environment, duration, performance metrics, and the classification labels used.
This section details the procedure to use the trained PoseC3D model for skeleton-based action recognition in video files. The demo is designed to be run in a Google Colab environment.
- Human Detector: Faster-RCNN
- Pose Estimator: HRNetw32
- Skeleton-based Action Recognizer: PoseC3D-CUSTOM-keypoint
- Model Checkpoint:
best_acc_top1_epoch_22.pth
The following is an example of how to run the demo on a specific video file (ZX0QNPTZO56M.mp4
). This script will process the video and generate an output video (demo_output_ZX0QNPTZO56M.mp4
) with recognized actions.
!python demo/demo_skeleton.py \
video/mlb/ZX0QNPTZO56M.mp4 \
demo/demo_output_ZX0QNPTZO56M.mp4 \
--config configs/skeleton/posec3d/slowonly_r50_u48_240e_customdata_xsub_keypoint.py \
--checkpoint work_dirs/slowonly_r50_u48_240e_customdata_xsub_keypoint/best_acc_top1_epoch_22.pth \
--det-config demo/demo_configs/faster-rcnn_r50_fpn_2x_coco_infer.py \
--det-checkpoint http://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth \
--det-score-thr 0.9 \
--det-cat-id 0 \
--pose-config demo/demo_configs/td-hm_hrnet-w32_8xb64-210e_coco-256x192_infer.py \
--pose-checkpoint https://download.openmmlab.com/mmpose/top_down/hrnet/hrnet_w32_coco_256x192-c78dce93_20200708.pth \
--label-map tools/data/skeleton/label_map_custom.txt
This script includes the paths to the necessary configuration files, checkpoints for the human detector and pose estimator, and the custom label map for the trained PoseC3D model.
-
Input: Video of a baseball pitch
-
Output: Label for the action in the video and a skeleton-annotated video
-
Result: Skeleton successfully displayed, and the label for the input video correctly identified.
-
Test with Baseball Broadcast Video
- Problem: The model misidentified all characters (pitcher, batter, catcher, umpire) in the frame as a batter.
- Cause: The model, designed to assess action based on a minimum of 15 frames, was only provided with a single frame, leading to incorrect inference.
- Approach: Divide the video into 48-frame chunks. Distinguish between individuals using cosine similarity, followed by PoseC3D pose estimation.
To run a demonstration of the PoseC3D model on a custom video, use the following command. This script processes a specified video, applying skeleton-based action recognition, and outputs the results in the designated directory.
!python demo/custom_demo_skeleton.py \
./data/cut_video.mp4 \
--output-dir ./demo/middle \
--output-file output \
--config ./configs/skeleton/posec3d/slowonly_r50_u48_240e_customdata_xsub_keypoint.py \
--checkpoint ./work_dirs/slowonly_r50_u48_240e_customdata_xsub_keypoint/best_acc_top1_epoch_22.pth \
--output-fps 15
./data/cut_video.mp4
: Path to the input video file.--output-dir ./demo/middle
: Directory where the output will be saved.--output-file output
: Name of the output file.--config
: Path to the model configuration file.--checkpoint
: Path to the model checkpoint file.--output-fps 15
: Frame rate of the output video.
This command generates a video in the ./demo/middle
directory with recognized actions and corresponding skeletons overlaid on the input video frames.
The PoseC3D model has been effectively trained to recognize and classify actions specific to a baseball pitcher. Below are some examples demonstrating the model's ability to identify a pitcher and their specific actions from video frames.
The model successfully detects the pitcher in various frames, highlighting the robustness of the pose estimation and action recognition capabilities.
Once the pitcher is recognized, the model further analyzes the specific actions performed by the pitcher. This feature is crucial for detailed analysis and understanding of the game. The following images showcase the model's capability in identifying distinct actions of a pitcher.
These examples demonstrate the model's efficacy in not only detecting the presence of a pitcher but also in accurately classifying their specific movements and actions.