E2E-LOAD: End-to-End Long-form Online Action Detection

Introduction

This is a PyTorch implementation for our ICCV 2023 paper "E2E-LOAD: End-to-End Long-form Online Action Detection".

Environment

This repo is a modification on PySlowFast. Please follow their guidelines to prepare the environment.

Data Preparation

Download the THUMOS'14 and TVSeries datasets.
Extract video frames at 24 FPS;
For constructing the target files, we follow the method used in LSTR.

The data should be organized according to the following structure. Please modify the root path of the dataset in the configuration file of the corresponding dataset.

$DATASET_ROOT
├── frames/
|   ├── video_test_0000004/ (6L images)
|   |   ├── img_00000.jpg
|   |   ├── ...
│   ├── ...
├── targets/
|   ├── video_test_0000004.npy (of size L x 22)
|   ├──...

Training

Before training, please download the pre-trained model from MViTv2;

The commands are as follows.

REPO_PATH='YOUR_CODE_PATH'
export PYTHONPATH=$PYTHONPATH:$REPO_PATH
python tools/run_net.py --cfg configs/THUMOS/MVITv2_S_16x4.yaml

Online Inference

There are two kinds of evaluation methods in our code.

In order to quickly validate the model's performance during the training process, all test videos are divided into non-overlapping segments, with each segment directly predicting all frames. It is essential to note that these test results are not the final evaluation since most frames do not utilize a sufficiently history during inference. For more detailed information, please refer to LSTR.
After the training is completed, you can perform online inference by selecting the best model based on the results obtained during testing while training. This process involves frame-by-frame testing of the entire video, which aligns with real-world applications. Please note that the reported results in our paper are achieved under this mode. You can configure different modes for online testing using the provided configuration file.
```
DEMO:
  ENABLE: True
  INPUT_VIDEO: ['video_validation_0000690'] # Only valid when ALL_TEST=False.
  CACHE_INFERENCE: True # Efficient inference.
  ALL_TEST: True # Test all videos or only one video;
```
The inference commands are as follows:
```
REPO_PATH='YOUR_CODE_PATH'
export PYTHONPATH=$PYTHONPATH:$REPO_PATH
python tools/run_net.py --cfg configs/THUMOS/MVITv2_S_16x4_stream.yaml
```

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@article{cao2023e2e,
    title={E2E-LOAD: End-to-End Long-form Online Action Detection},
    author={Cao, Shuqiang and Luo, Weixin and Wang, Bairui and Zhang, Wei and Ma, Lin},
    journal={arXiv preprint arXiv:2306.07703},
    year={2023}
}

Acknowledge

The project is built upon MViTv2 and LSTR.

sqiangcao99/E2E-LOAD