'CLIP' (Radford et al., 2021) implementation from scratch in PyTorch
CLIP trained on Flickr8k + Flickr30k for 200 epochs
Linear Classification on ImageNet1k (mini) Dataset
# e.g.,
python3 linear_classification.py\
--ckpt_path=" ../clip_flickr.pth" \
--data_dir=" ../imagenet-mini/" \
--n_epochs=64\
--batch_size=128\
--n_cpus=4 # Optional
Top-5 accuracy on validation set: 5.8%
Zero-shot Classification on ImageNet1k (mini) Dataset
# e.g.,
python3 zero_shot_classification.py\
--ckpt_path=" ../clip_flickr.pth" \
--data_dir=" ../imagenet-mini/" \
--batch_size=16\
--n_cpus=4\ # Optional
--max_len=128\ # Optional
--k=10 # Optional
Top-10 accuracy on train + validation set: 3.0%
Temperature와 관련한 부분은 구현하지 않았습니다.
"The learnable temperature parameter was clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability."