Object Detection with Self-Supervised Scene Adaptation

This repository is the official implementation of paper:

Zekun Zhang, Minh Hoai, Object Detection with Self-Supervised Scene Adaptation, CVPR 2023.

[CVPR OpenAccess][PDF with Supplementary] [Poster] [Video Presentation] [Slides]

Scripts for downloading videos, frames, annotations, and models are also provided. If you found our paper or dataset useful, please cite:

@inproceedings{Zhang2023CVPR,
    author    = {Zhang, Zekun and Hoai, Minh},
    title     = {Object Detection With Self-Supervised Scene Adaptation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {21589-21599}
}

Overview of proposed self-supervised scene adaptive object detection framework.

Network architectures of proposed fusion faster-RCNN that takes object masks as additional input modality.

mosaic.webm

Some images from proposed Scenes100 dataset with object bounding box annotations for evaluation.

1. Environment Setup

Your system needs to have an NVIDIA GPU with at least 20GB VRAM to run the fusion model adaptation training with default settings. The GPU hardware and driver should support CUDA version 11.3 and later. This repository has been tested on Ubuntu 20.04 and 22.04. Older systems might not work. You system needs to have curl and unzip installed for the dataset downloading scripts to work. We recommend to start from a fresh Python environment and install the required packages to avoid incompatibility issues. For instance, you can create a new environment in Anaconda and switch to it:

conda create -n scenes100
conda deactivate && conda activate scenes100

Detectron2

Follow the instructions at Detectron2 Installation to install detectron2 v0.6. Other versions will not work! During this process some other required packages such as numpy, pytorch, torchvision, pillow, matplotlib, and pycocotools should also be installed as dependencies. Please verify that your detectron2 installation can work properly on your GPU before moving forward.

Other Packages

Install the following packages using your preferred package manager.

lmdb is used to read the pre-extracted images. Follow the official instructions.

imageio is used for reading and writing images. Follow the official instructions.

networkx is used for the graph-based psuedo bounding boxes refinement. Follow the official instructions.

imantics is used for converting between polygons and pixel masks. Follow the instructions in the official repository.

tqdm is used for displaying progress bars. Follow the instructions in the official repository.

2. Download Scenes100 Dataset

Now please clone the repository and switch to the root directory:

git clone git@github.com:cvlab-stonybrook/scenes100.git /your/path/scenes100
cd /your/path/scenes100

Conventions

Each video in Scenes100 has a unique 3-digits ID. In this repository wherever ID is used, it refers to this video ID unless specified otherwise. The 100 IDs are: 001 003 005 006 007 008 009 011 012 013 014 015 016 017 019 020 023 025 027 034 036 039 040 043 044 046 048 049 050 051 053 054 055 056 058 059 060 066 067 068 069 070 071 073 074 075 076 077 080 085 086 087 088 090 091 092 093 094 095 098 099 105 108 110 112 114 115 116 117 118 125 127 128 129 130 131 132 135 136 141 146 148 149 150 152 154 156 158 159 160 161 164 167 169 170 171 172 175 178 179. Please note the IDs are not consecutive. They should be treated as strings instead of integers. The meta information including width, height, frame rate, and timestamp is stored in the file scenes100/videos.json. All the image files in the downloaded dataset start with 8 digits representing the frame index. For instance if the filename is 00187109.jpg and it is extracted from a video of 30 FPS, it is at 01:43:57.

The whole downloaded and extracted dataset is about 2.2TB in size. Make sure you have enough free space and good Internet connection. We suggest only to download the data for a few videos and test if everything else works. When using the given commands to download the dataset, please pay attention to the screen output, as certain operations can fail due to network issues. If the downloaded file for a certain ID is corrupted, please re-run the command to download it again. In the following commands, you can replace 001 or 003 with all to download the whole dataset later.

Annotated Frames and MSCOCO

You can download and extract the files of manual annotated validation images, pseudo labels from detection and tracking, and background model images:

python datasets.py --opt download --target annotation
python datasets.py --opt extract  --target annotation

No video ID is specified because these commands always download the data for all the videos.

You can download and extract the source domain data, which is the MSCOCO-2017 dataset, and the base models trained on it:

python datasets.py --opt download --target mscoco
python datasets.py --opt extract  --target mscoco

The images with target objects being inpainted are also downloaded. You can remove the ZIP files under scenes100 and mscoco to save some space after extraction.

Training Frames

We provide 2 options of preparing the training frames.

Option 1: You can download the original video files and run a decoder to extract the frames from them. You need to have ffmpeg and scikit-video installed. If you encounter package conflicts while installing or using ffmpeg, please take option 2. For example, for video 001, run:

python datasets.py --opt download --target video --ids 001
python datasets.py --opt decode   --target image --ids 001

This option requires less data to be downloaded, but the decoding process can take quite some time. And due to implementation differences on various systems, the decoded frames can be slightly different.

Option 2: You can download the frames already decoded by us. For example, for video 003, run:

python datasets.py --opt download --target image --ids 003
python datasets.py --opt extract  --target image --ids 003

We use LMDB database to save the bundle of training images, but still extract the single images as individual JPEG files for training. You can write your own LMDB based dataloader for training. A minimum implementation is provided in scenes100/training_frames.py for your reference. This option requires about 3x the data of option 1 to be downloaded, but the extraction process is much faster than decoding. It also makes sure you are using the exact same images as we do in our experiments.

3. Run Experiments

Adaptation Traininig

To run the adaptation training, use the script train_adaptation.py. Please check the arguments help information on how to use it.

# train a vanilla faster-RCNN with R-101 backbone, with pseudo-labeling of R-101 and R-50 base models
python train_adaptation.py --id 001 --model r101-fpn-3x --ckpt mscoco/models/mscoco2017_remap_r101-fpn-3x.pth --anno_models r101-fpn-3x r50-fpn-3x --fusion vanilla --iters 4000 --eval_interval 501 --image_batch_size 4 --num_workers 4

# train a vanilla faster-RCNN with R-101 backbone, with pseudo-labeling of R-101 and R-50 base models, using location-aware mixup
python train_adaptation.py --id 001 --model r101-fpn-3x --ckpt mscoco/models/mscoco2017_remap_r101-fpn-3x.pth --anno_models r101-fpn-3x r50-fpn-3x --fusion vanilla --mixup 1 --iters 4000 --eval_interval 501 --image_batch_size 4 --num_workers 4

# train an early-fusion faster-RCNN with R-101 backbone, with pseudo-labeling of R-101 and R-50 base models
python train_adaptation.py --id 003 --model r101-fpn-3x --ckpt mscoco/models/mscoco2017_remap_wdiff_earlyfusion_r101-fpn-3x.pth --anno_models r101-fpn-3x r50-fpn-3x --fusion earlyfusion --iters 4000 --eval_interval 501 --image_batch_size 4 --num_workers 4

# train a mid-fusion faster-RCNN with R-101 backbone, with pseudo-labeling of R-101 and R-50 base models, using location-aware mixup (our best combination)
python train_adaptation.py --id 003 --model r101-fpn-3x --ckpt mscoco/models/mscoco2017_remap_wdiff_midfusion_r101-fpn-3x.pth --anno_models r101-fpn-3x r50-fpn-3x --fusion midfusion --mixup 1 --iters 4000 --eval_interval 501 --image_batch_size 4 --num_workers 4

By default, the resulting checkpoints will be saved to the current directory.

Evaluate Detection Performance

The average precision numbers computed during the training process is not accurate, for they do not involve the non-annotation masks. To get accurate AP numbers for adapted models, run:

python evaluate_adaptation.py --opt single --id 001 --model r101-fpn-3x --fusion vanilla --ckpt adapt001_r101-fpn-3x_anno_train_001_refine_r101-fpn-3x_r50-fpn-3x.pth
python evaluate_adaptation.py --opt single --id 001 --model r101-fpn-3x --fusion vanilla --ckpt adapt001_r101-fpn-3x_anno_train_001_refine_r101-fpn-3x_r50-fpn-3x_mixup.pth
python evaluate_adaptation.py --opt single --id 003 --model r101-fpn-3x --fusion earlyfusion --ckpt adapt003_r101-fpn-3x_anno_train_003_refine_r101-fpn-3x_r50-fpn-3x_earlyfusion.pth
python evaluate_adaptation.py --opt single --id 003 --model r101-fpn-3x --fusion midfusion --ckpt adapt003_r101-fpn-3x_anno_train_003_refine_r101-fpn-3x_r50-fpn-3x_midfusion_mixup.pth

Compare with Base Models

To see how the adapted models perform compared to the base models, first evaluate the base model on all videos:

python evaluate_adaptation.py --opt base --model r101-fpn-3x --ckpt mscoco/models/mscoco2017_remap_r101-fpn-3x.pth --base_result_json results_base_r101-fpn-3x.json

The results will be saved in files results_base_r101-fpn-3x.json and results_base_r101-fpn-3x.pdf. Then put the adapted model checkpoints in a separate directory. For instance, you can download our best adapted models by:

mkdir -p trained_models/best_midfusion_mixup
curl --insecure https://vision.cs.stonybrook.edu/~zekun/scenes100/checkpoints/best_midfusion_mixup.zip --output trained_models/best_midfusion_mixup/best_midfusion_mixup.zip
unzip trained_models/best_midfusion_mixup/best_midfusion_mixup.zip -d trained_models/best_midfusion_mixup

We assume in the directory all the checkpoint files have the format of adaptXXX*.pth, where XXX is one of the video IDs. For each video ID there should only be 1 checkpoint presenting. To evaluate all the adapted models and compare with the already evaluated base model, run:

python evaluate_adaptation.py --opt batch --model r101-fpn-3x --compare_ckpts_dir trained_models/best_midfusion_mixup --fusion midfusion --base_result_json results_base_r101-fpn-3x.json

Comparison results will be saved to the directory trained_models/best_midfusion_mixup. Due to variations of system and Python packages, the resulting AP gains can differ slightly from the numbers reported in the paper, but the difference should not be more than 0.05.

4. (Optional) Background Extraction

To generate dynamic background images by yourselves instead of using the provided ones, you need to have opencv-python installed to use its image inpainting functions. Then run:

mkdir background_001
python extract_background.py --id 001 --outputdir background_001

The inpainted dynamic background images will be saved to background_001/inpaint as JPEG files.

5. (Optional) Generate Pseudo Labels

Please follow these steps if you want to generate pseudo bounding boxes by yourselves instead of using the provided ones.

PyTracking

Install PyTracking following the official repository. Please note that our code is tested with the commit 47d9c16. Other version might cause issues. For our code to work, you only need to install ninja-build, jpeg4py, and visdom among the dependencies, and download the DiMP-50 model weights. CUDA Toolkit needs to be installed for PyTracking to compile the ROIPooling modules. The code has been tested with version 11.3, but newer versions should all be compatible. You might need to refer to this post and replace THCudaCheck with AT_CUDA_CHECK in the ROIPooling code, if you encounter PyTorch C++ namespaces issues.

Labeling

To run the pseudo labeling, use the script pseudo_label.py. Please check the arguments help information on how to use it. For instance, to detect the target objects in the training frames of video 003 using an R-101 base model, run:

python pseudo_label.py --opt detect --id 003 --model r101-fpn-3x --ckpt mscoco/models/mscoco2017_remap_r101-fpn-3x.pth

The file containing the pseudo detection bounding boxes is saved to the current directory by default, as a GZIP file.

Then you can run the single object tracker using the detected pseudo bounding boxes as initializations:

python pseudo_label.py --opt track --id 003 --detect_file 003_detect_r101-fpn-3x.json.gz --pytracking_dir /your/path/pytracking --cuda_dir /cuda/toolkit

The resulting GZIP file will also be saved to the current directory by default.