Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition
This repository is the official implementation of Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition (Under Review).
- 1. Change Log
- 2. Prerequisites
- 3. Prepare the Datasets
- 4. Run the Code
- 5. Results
- 6. License
- 7. Acknowledgement
- 8. Citation
- [2023/03/07] Code Upload.
pip install -r requirements.txt
Please refer to CTR-GCN and follow the instructions in section Data Preparation to prepare NTU RGB+D 120.
For your convenience, here is the excerpt of the instructions in section Data Preparation:
DownLoad
- Request dataset here: https://rose1.ntu.edu.sg/dataset/actionRecognition
- Download the skeleton-only datasets:
- nturgbd_skeletons_s001_to_s017.zip (NTU RGB+D 60)
- nturgbd_skeletons_s018_to_s032.zip (NTU RGB+D 120)
- Extract above files to ./data/nturgbd_raw
Directory Structure
Put downloaded data into the following directory structure:
- data/
- ntu/
- ntu120/
- nturgbd_raw/
- nturgb+d_skeletons/ # from `nturgbd_skeletons_s001_to_s017.zip`
...
- nturgb+d_skeletons120/ # from `nturgbd_skeletons_s018_to_s032.zip`
...
Generating Data
Generate NTU RGB+D 120 dataset:
cd ./data/ntu120
# Get skeleton of each performer
python get_raw_skes_data.py
# Remove the bad skeleton
python get_raw_denoised_data.py
# Transform the skeleton to the center of the first frame
python seq_transformation.py
DownLoad
Directory Structure
Generating Data
DownLoad
- Request dataset here: https://h2odataset.ethz.ch/ . You can get the username and password from the download page.
- Download the dataset directly from the download page or using
download_script.py
in h2odataset repo (we have included it in./data/h2o/download_scirpt.py
in this repo)Selectcd ./data/h2o python download_script.py --username "username" --password "password" --mode pose --dest "dest folder path"
pose
mode to download only pose (hand, object, egocentric view) without RGB-D images. - Extract the downloaded files.
Directory Structure
path/to/your/extracted/files
├── label_split
├── subject1
│ ├── h1
│ │ ├── 0
│ │ │ └── cam4
│ │ │ ├── cam_pose
│ │ │ ├── hand_pose
│ │ │ ├── hand_pose_MANO
│ │ │ ├── obj_pose
│ │ │ ├── obj_pose_RT
│ │ │ ├── action_label
│ │ │ └── verb_label
│ │ ├── 1
│ │ ├── 2
│ │ ├── 3
│ │ └── ...
│ ├── h2
│ ├── k1
│ ├── k2
│ ├── o1
│ └── o2
├── subject2
├── subject3
├── subject4
└── object
Generating Data
Generate H2O pth files using ./data/h2o/generate_h2o.py
.
python generate_h2o.py --root path/to/your/extracted/files --dest ./h2o_pth --frames 120
DownLoad
- Submit an access request with your google account in Google Drive. Download
poses_60fps
directly or using scripts in assembly101-download-scripts. - Download fine-grained-annotations in Google Drive
Directory Structure
path/to/your/downdload/root
├── fine-grained-annotations
│ ├── actions.csv
│ ├── head_actions.csv
│ ├── test.csv (@30fps)
│ ├── test_challenge.csv (@30fps)
│ ├── train.csv (@30fps)
│ └── validation.csv (@30fps)
└── poses_60fps
├── nusar-2021_action_both_9011-a01_9011_user_id_2021-02-01_153724.json
├── nusar-2021_action_both_9011-b06b_9011_user_id_2021-02-01_154253.json
├── ...
Generating Data
TBD
The test set has a less number of valid samples than the provided test_challenge.csv
. The 1018 invlid test samples (about 5%) has no pose data and will fail to predict. This may cause lower accuracy reports in CodaLab Challenge Page. More information about this could be found in discussions assembly101 Issue#4.
The Cross-subject (X-Sub) and Cross-set (X-Set) criteria are employed, using only the joint modal data to ensure fair comparisons without fusing multiple modalities.
X-Sub
python main.py --config config/ntu/ntu26_xsub_joint.yaml
X-Set
python main.py --config config/ntu/ntu26_xset_joint.yaml
5-fold cross validation approach suggested in SBU is adopted. To get accuracy for each fold, arg fold
should be set to 0, 1, 2, 3 or 4 in sbu_noisy_joint.yaml
and sbu_clean_joint.yaml
. Run each command for 5 times with different fold
and average the test results.
Noisy
python main.py --config config/sbu/sbu_noisy_joint.yaml
Clean
python main.py --config config/sbu/sbu_clean_joint.yaml
Train & Validate
python main.py --config config/h2o/h2o.yaml
Generate JSON File for Test Submission
python main.py --config config/h2o/h2o_get_test_results.yaml --weights path/to/your/checkpoint
Submit zipped json file action_labels.json
in CodaLab Challenge H2O - Action to get the test results.
Train & Validate
# Action (mandatory): 1380 classes
python main.py --config config/asb/asb_action.yaml
# Verb (optional): 24 classes
python main.py --config config/asb/asb_verb.yaml
# Object (optional): 90 classes
python main.py --config config/asb/asb_noun.yaml
Generate JSON File for Test Submission
# Action (mandatory): 1380 classes
python main.py --config config/asb/asb_action_get_test_results.yaml --weights path/to/your/action/checkpoint
# Verb (optional): 24 classes
python main.py --config config/asb/asb_verb_get_test_results.yaml --weights path/to/your/verb/checkpoint
# Object (optional): 90 classes
python main.py --config config/asb/asb_noun_get_test_results.yaml --weights path/to/your/noun/checkpoint
Submit zipped json file preds.json
in CodaLab Challenge Assembly101 3D Action Recognition to get the test results.
TBD
Grateful to the collaborators/maintainers of STTFormer, CTR-GCN, MS-G3D, h2odataset, Assembly101 repository. Thanks to the authors for their great work.
If you find this work or code helpful in your research, please consider citing:
TBD