CountFormer: Multi-View Crowd Counting Transformer

Introduction

This repository is an official implementation of CountFormer. CountFormer is a concise 3D multi-view counting (MVC) framework towards deployment in real-world deployment.

We creatively design a revolutionary multi-view counting (MVC) framework, called CountFormer, which is the first attempt to solve the 3D MVC problem to fit a real-world environment.
A feature lifting module and an MV volume aggregation module are conceived to transform the MV image-level features w.r.t arbitrary dynamic camera layouts into a unified scene-level volume representation.
We present an effective strategy to embed the camera parameters into the image-level features and the volume query, facilitating accurate and adaptable representation among diverse camera setups.

Framework of the CountFormer. The Image Encoder extracts multi-view and multi-level features (MVML) from the multi-view images of the scene. ImageLevel Camera Embedding Module fuses camera intrinsic and extrinsic with the MVML features. The elaborate Cross-View Attention Module, a sophisticated attention component, transforms the image-level features into scene-level volume representations. Besides main components, a 2D Density Predictor is used to estimate the image space density, 3D Density Predictors are employed to regress for the 3D scene-level density, and a simple feature pyramid network fuses the multi-scale voxel features.

News

2024.07.08 The code of CountFormer is released on github for research purpose.

2024.07.01 The CountFormer has been accepted by the Top-tier conference ECCV 2024.

Results in paper

Comparisions with SOTAs

Training Script

After preparation, you will be able to see the following directory structure:

CountFormer
├── data
│   ├── cross_view
│   ├── citystreet
│   ├── ....
├── projects
│   ├── configs
│   ├── dataset
│   ├── modules
│   ├── registry
│   ├── ....
├── tools
├── README.md

sh tools/do_train.sh

Note that the training of CountFormer necessitate training 3 days on 8x A100 GPUs (80GB)

Citation

If you find SparseDrive useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@inproceedings{mo2024countformer,
  title={CountFormer: Multi-View Crowd Counting Transformer},
  author={Mo, Hong and Zhang, Xiong and Tan, Jianchao and Yang, Cheng and Gu, Qiong and Hang, Bo and Ren, Wenqi},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2024},
  organization={Springer},
}