Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

This repository is the official implementation of Side4Video, which significantly reduces the training memory cost for action recognition and text-video retrieval tasks.

📰 News

Feb 28, 2024. We release our code for Action Recognition and Text-Video Retrieval.
Nov 28, 2023. We release our paper in arxiv.

🗺️ Overview

🚀 Training and Testing

For training and testing our model, please refer to the Recognition and Retrieval folders.

📊 Results

Our best model can achieve an accuracy of 67.3% & 74.6 on Something-Something V1 & V2, 88.6% on Kinetics-400 and a Recall@1 of 52.3% on MSR-VTT, 56.1% on MSVD, 68.8% on VATEX.

🖇️ Citation

If you find this repository is useful, please star🌟 this repo and cite🖇️ our paper.

@article{yao2023side4video,
  title={Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning},
  author={Yao, Huanjin and Wu, Wenhao and Li, Zhiheng},
  journal={arXiv preprint arXiv:2311.15769},
  year={2023}
}

👍 Acknowledgment

Our implementation is mainly based on the following codebases. We are sincerely grateful for their work.

Text4Vis: Revisiting Classifier: Transferring Vision-Language Models for Video Recognition.
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.

📧 Contact

If you have any questions about this repository, please file an issue or contact Huanjin Yao or Wenhao Wu .

HJYao00/Side4Video