PyramidCLIP: A Python repository from Yuting-Gao

Introduction

Large-scale vision-language pre-training has achieved promising results on down-stream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffers from the semantic mismatch and the mutual compatibility. To address these issues, we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels for each modality, and aligns visual elements and linguistic elements in the form of hierarchy via peer-level semantics alignment and cross-level relation alignment. Furthermore, we soften the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of forcing the model to distinguish compatible negative pairs. Experiments on five downstream tasks demonstrate the effectiveness of the proposed PyramidCLIP. In particular, with the same amount of 15 million pre-training image-text pairs, PyramidCLIP exceeds CLIP on ImageNet zero-shot classification top-1 accuracy by 10.6%/13.2%/10.0% with ResNet50/ViT-B32/ViT-B16 based image encoder respectively. When scaling to larger datasets, PyramidCLIP achieves the state-of-the-art results on several downstream tasks. In particular, the results of PyramidCLIP-ResNet50 trained on 143M image-text pairs surpass that of CLIP using 400M data on ImageNet zero-shot classification task, significantly improving the data efficiency of CLIP.

Updates

20221115 PyramidCLIP models pretrained on 143M image-text pairs are released.

20221025 PyramidCLIP is designated as Oral by NeurIPS2022.

20220920 PyramidCLIP models pretrained on 15M image-text pairs are released.

20220915 PyramidCLIP is accepted by NeurIPS2022.

PyramidCLIP Model-Zoo

PyramidCLIP models on 143M dataset

Method	Dataset	Model	Epochs	IN ZS Top-1	Weights
PyramidCLIP	TOTAL143M	ResNet50	32	61.4	GoogleDriver
PyramidCLIP	TOTAL143M	ViT-B32	32	62.5	GoogleDriver
PyramidCLIP	TOTAL143M	ViT-B16	32	66.9	GoogleDriver

PyramidCLIP models on YFCC15M-V1 dataset

Method	Dataset	Model	Epochs	IN ZS Top-1	Weights
PyramidCLIP	YFCC15M-V1	ResNet50	32	43.8	GoogleDriver
PyramidCLIP	YFCC15M-V1	ViT-B32	32	41.7	GoogleDriver
PyramidCLIP	YFCC15M-V1	ViT-B16	32	45.9	GoogleDriver

PyramidCLIP models on YFCC15M-V2 dataset

Method	Dataset	Model	Epochs	IN ZS Top-1	Weights
PyramidCLIP	YFCC15M-V2	ResNet50	32	47.8	GoogleDriver
PyramidCLIP	YFCC15M-V2	ViT-B32	32	46.0	GoogleDriver
PyramidCLIP	YFCC15M-V2	ViT-B16	32	50.7	GoogleDriver

Inference

see demo.py

Zero-shot Evaluation

Preparation

Download the checkpoint from PyramidCLIP model zoo, and put into the pretrained_model/ directory.
Copy the test data of ImageNet into /path/val/.

Run Evaluation

see test.sh

pip install ftfy regex

python3 -um torch.distributed.launch --nnodes=$HOST_NUM --nproc_per_node=$HOST_GPU_NUM \
--node_rank=$INDEX --master_port=3111 --master_addr=$CHIEF_IP \
main.py \
--visual_model RN50 \ # RN50|ViT-B-32|ViT-B-16
--batch_size_test 256 \
--test_dataset imagenet \
--test_data_path /path/val/ \
--precision fp32 \
--evaluate pretrained_model/RN50.pth.tar

Yuting-Gao/PyramidCLIP