It is currently being organized and will be open-sourced. The code and data will be made public after undergoing our de-identification review.
We collected real Tai Chi video data, which was professionally annotated with scores by sports experts. This data aims to explore potential complex action features, differing from traditional classification-based rating evaluations, such as grading actions as A, B, C, or D levels.
Why do we use continuous variables as labels: Although it may be cumbersome to modify the granularity of performance ratings established in classification tasks, it is generally possible to adjust them through methods such as reorganizing datasets and retraining models. Additionally, it is generally true that finer-grained classification tasks tend to be more challenging. Adopting smoothed labels and regression models can indeed lead to higher performance and finer-grained assessments, which better align with real examination and teaching scenarios. Although it requires more significant effort, this approach is more in line with real-world applications.
Why don't We directly compare feature values as in facial recognition: In action scoring tasks, directly comparing feature values may overlook spatial and temporal information of the actions. Additionally, sports experts have pointed out that the evaluation of scores should not solely rely on the similarity of actions; it involves a certain level of subjectivity or artistry. We aim for our data to provide this information and enable the model to represent it.
https://drive.google.com/drive/folders/1ZTsiah25xqdNVz9kxE4-tHAG2uSbF-AC?usp=drive_link
8k_aug | 16k_aug |
principle=0.4 | principle=0.6 |
principle=1.0 | clip |
We initially aimed to achieve both classification and regression simultaneously through a one-stage approach. However, despite our efforts, the final classification and regression performance (as shown in model iv) did not meet our expected metrics.Additionally, under the guidance of experts, we designed a reasonable data augmentation method.
i) and ii)
iii)
iv)
Model | Taichi score MAE | Taichi classification Acc |
---|---|---|
i | 0.2021 | 59.17% |
ii | 0.0965 | 84.42% |
iii | 0.0862 | 86.26% |
iv | 0.0782 | 95.58% |
i) Extract features using the ST-GCN backbone and feed the obtained feature map into both the classification and regression heads with CoLU.
ii) Building upon i, using the data augmentation.
iii) Building upon ii, split the feature map along the spatial dimension into two parts, and then separately feed them into the classification and regression heads.
iv) Building upon iii, concatenate the feature embeddings from the classification head with the input to the regression head.
ST-GCN vs STD-GCN vs SST-GCN vs SSTD-GCN vs ST-GCN++ vs SSTD-GCN++ Demo
Google Colab Demo Note: the metrics in the Colab demo might experience slight variations due to version changes, but the overall performance should be approximately similar.
Model | NTU-RGB-D | Taichi | Param. (M) | FLOPs (G) |
---|---|---|---|---|
ST-GCN | 76.00% | 65.47% | 0.17 | 0.20 |
STGL-GCN | 77.50% | 83.75% | 2.78 | 1.89 |
SSTD-GCN(ours) | 87.00% | 99.17% | 0.18 | 0.11 |
ST-GCN++ | 90.50% | 93.33% | 3.09 | 0.60 |
SSTD-GCN++(ours is embedded to ST-GCN++) | 92.00% | 99.58% | 0.32 | 0.61 |
Model | Spacial Separate | Temporal Dilation | Taichi score MAE |
---|---|---|---|
ix | ❌ | ❌ | 0.0355 |
x | ❌ | ❌ | 0.0295 |
xi | ❌ | ✔️ | 0.0243 |
xii | ✔️ | ❌ | 0.0261 |
xiii | ✔️ | ✔️ | 0.0196 |