An Empirical Study of CLIP for Text-based Person Search

This repository is the code for the paper An Empirical Study of CLIP for Text-based Person Search.

Environment

All the experiments are conducted on 4 Nvidia A40 (48GB) GPUs. The CUDA version is 11.7.

The required packages are listed in requirements.txt. You can install them using:

pip install -r requirements.txt

Download

Download CUHK-PEDES dataset from here, ICFG-PEDES dataset from here and RSTPReid dataset from here.
Download the annotation json files from here.
Download the pretrained CLIP checkpoint from here.

Configuration

In config/config.yaml and config/s.config.yaml, set the paths for the annotation file, image path and the CLIP checkpoint path.

Training

You can start the training using PyTorch's torchrun with ease:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --rdzv_id=3 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 --nproc_per_node=4 \
main.py

Simplified TBPS-CLIP

You can easily run simplified TBPS-CLIP using:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --rdzv_id=3 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 --nproc_per_node=4 \
main.py --simplified

License

This code is distributed under an MIT LICENSE.