KimRass/CLIP

PyTorch implementation of 'CLIP' (Radford et al., 2021) from scratch and training it on Flickr8k + Flickr30k

Python

'CLIP' (Radford et al., 2021) implementation from scratch in PyTorch

Learning Transferable Visual Models From Natural Language Supervision

Pretrained Model

CLIP trained on Flickr8k + Flickr30k for 200 epochs
- clip_flickr.pth

Linear Classification on ImageNet1k (mini) Dataset

# e.g.,
python3 linear_classification.py\
    --ckpt_path="../clip_flickr.pth"\
    --data_dir="../imagenet-mini/"\
    --n_epochs=64\
    --batch_size=128\
    --n_cpus=4 # Optional

Top-5 accuracy on validation set: 5.8%

Zero-shot Classification on ImageNet1k (mini) Dataset

# e.g.,
python3 zero_shot_classification.py\
    --ckpt_path="../clip_flickr.pth"\
    --data_dir="../imagenet-mini/"\
    --batch_size=16\
    --n_cpus=4\ # Optional
    --max_len=128\ # Optional
    --k=10 # Optional

Top-10 accuracy on train + validation set: 3.0%

Implementation Details

Temperature와 관련한 부분은 구현하지 않았습니다.
- "The learnable temperature parameter was clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability."