It is our solutions repository for Ego4D challenges in ECCV2022 workshop.
Ego4D Slides (in Chinese) Ego4D Solutions (in Chinese)
(2023/10/10) We use the weights pre-trained on the verb subset for TAL of Perception Test task and bring 2 points of performance improvement. The extracted features can be download at here.
(2023/07/13) We release the ViT-L weights finetuned on Ego4D-MQ dataset.
(2023/04/11) 🚀We release the leading model of SCOD task.
(2022/12/11) 🚀🚀We release code and checkpoints of pretraining, FHP task and SCOD task.
(2022/12/01) 🚀The VideoMAE features for MQ and NLQ are released.
(2022/11/17) 🔄The repository is created.
- Codes for Feature Extractor
- Verb Noun Features (VideoMAE-L) for MQ and NLQ
- Codes for pretraining
- Codes for STA
- Codes for Hands
- Codes and checkpoints for SCOD
We provide the video features extracted by VideoMAE-L pretrained on verb and noun subset.
Feature | Baidu Netdisk | Zenodo |
---|---|---|
MQ(verb) | Download. code: sxda | Download |
NLQ(verb) | Download. code: teod | Download |
NLQ(noun) | Download. code: wrop | Download |
You can check more details in our techical report.
Our training strategy is based on the vanilla method and is easy to follow. We use VideoMAE codebase for training and validation. Before training, you have to follow it to install the python environment. We split the training annotations filtered by EgoVLP for rapid development. The second-filtered annotations files are available here. We release the checkpoints in the below table.
Method | Pretrain | Resolution | Subset | Top-1 | Top-5 | Weights |
---|---|---|---|---|---|---|
ViT-L | K700 | 224x224 | verb | 52.51 | 86.05 | Download |
ViT-L | K700 | 224x224 | noun | 33.41 | 85.51 | Download |
ViT-L | K700+verb | 224x224 | MQ | - | - | Download |
UniFormer-B | K600 | 320x320 | verb | 49.30 | 83.61 | Download |
Note: For the ViT-L weight finetuned on MQ tasks, some keys of state_dict may need to modify to adapt the model code.
We provide the training script on SLURM mode. If you want to use PyTorch-DDP mode, you can use scripts in scripts/pytorch_ddp
.
bash scripts/slurm/ego4d_verb_slurm_pretrain_vitl_k400.sh
In the script, you need to set the approaiate OUTPUT_DIR
and MODEL_PATH
.
We use the ViT-Large model to train the STA task.
sh scripts/slurm/sta_train.sh
cd forecasting_eval
sh sta_val.sh
We train the FHP task using Uniformer-B and the weights pretrained on Ego4D verb subset.
We provide the training script on SLURM mode. If you want to use PyTorch-DDP mode, you can use scripts in scripts/pytorch_ddp
.
bash scripts/slurm/ego4d_hands_uniformer.sh
In the script, you need to set the approaiate OUTPUT_DIR
and MODEL_PATH
.
We also provide the script for validation and testing. You can launch the script below to validate a specific checkpoint's performance.
bash scripts/slurm/ego4d_hands_uniformer_val.sh
In the script, you need to set the approaiate OUTPUT_DIR
, MODEL_PATH
, --test_subset
and --test_num_segment
.
Our detection code for SCOD is developed on top of MMDetection.
Download: the converted annotations for SCOD: download
We report the performance on the validation set and release the checkpoint in the below table.
Method | Pretrain | Resolution | AP | AP50 | AP75 | Config | Download |
---|---|---|---|---|---|---|---|
UniFormer-L | IN-1K | 800-1600/2000 | 24.8 | 44.2 | 24.0 | config | ckpt | log |
Swin-L | IN-22K+O365 | 800-1600/2000 | 36.4 | 56.5 | 37.6 | config | ckpt | log |
To train UniFormer-L + DINO on the SCOD training set with 8 gpus for 12 epochs:
sh tools/dist_train.sh configs/scod/dino_5scale_uniformer-l_8x2_12e_scod_imagenet1k.py 8
To test UniFormer-L + DINO on the SCOD validation set with 8 gpus:
sh tools/dist_test.sh configs/scod/dino_5scale_uniformer-l_8x2_12e_scod_imagenet1k.py <ckpt-path> 8 --eval bbox
It should give:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.248
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.442
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.240
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.002
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.075
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.282
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.638
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.638
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.638
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.054
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.321
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.697
If this work is helpful for your research, please consider citing our techical report.
@article{chen2022ego4d,
title={InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges},
author={Chen, Guo and Xing, Sen and Chen, Zhe and Wang, Yi and Li, Kunchang and Li, Yizhuo and Liu, Yi and Wang, Jiahao and Zheng, Yin-Dong and Huang, Bingkun and others},
journal={arXiv preprint arXiv:2211.09529},
year={2022}
}