Sense-X/UniFormer

spatiotemporal behavior detection

yan-ctrl opened this issue · 10 comments

Hello, thank you for your work. I would like to ask you how to apply this work to the AVA dataset and do spatiotemporal behavior detection.

Sorry, I have not run AVA. However, I think you can follow VideoMAE to run it. They forked AlphaAction to run AVA. Just copying the model and reusing their repo!

good Thank you for your recommendation, but I'm afraid I don't have enough GPUs to run video MAE.

Yes. My suggestion is that you can copy the UniFormer model to run it. Just like how to use MMDetection/MMsegmentation...

Oh, you mean let me pre-train the model in your work or VideoMAE, and then fine-tune my own model in the AlphAction library.

Yes. The above repo is based on AlphAction and you can reuse their hyperparameters for transformer-based models. If you want to use UniFormer or other efficient backbones, you can transfer your model code to that repo like here (you may need to add ROIPooling).

Thank you for your patience. But I still have questions for you https://github.com/MCG-NJU/VideoMAE-Action-Detection , although AlphAction is used, the pre-training models provided are based on ViT, and SlowFast is not used as the backbone network, so:

  1. The function of AlphAction is just to detect the head, right? All models need to be based on ViT.

  2. VideoMAE-Action-Detection/modeling_ Finetune.py, if I use other backbones, how can I conduct MAE training.

  1. The repo is used for training an Action Detection model based on Kinetics-pretrained models.
  2. Your original problem is how to apply UniFormer to the AVA dataset. In my opinion, you can reuse their repo, and add UniFormer model. Why do you want to conduct MAE training?

Well, because I want to apply it to my own tasks, I need to customize data sets similar to AVA format, so labeling is troublesome. I can't label data sets as large as AVA. I want to see if self-supervised learning can help. So, as you said, I can train the parameters of the backbone network based on the Kinetics data set and migrate to the downstream tasks. But MAE is based on ViT as the backbone network, and your work is also a good backbone network, So I asked you if AlphAction only plays the role of using motion detection to evaluate the MAE model, like the classifier of image segmentation network, or as you said, using the UniFormer model, and then reuse AlphAction repo.

Q: "So I asked you if AlphAction only plays the role of using motion detection to evaluate the MAE model."
A: AlphAction is a general codebase for training action detection models. It's not only used for the MAE model. You can use other models as backbones.

Ok, thank you for your patience. I see