This repository contains the code for the MetaCLIP, described in the paper Demystifying CLIP Data that formalizes CLIP data curation as a simple algorithm. The main contributions are:
- Curating data from scratch without filtering via prior models (e.g., different from existing open source efforts that uses the original CLIP model as a teacher for filtering student data.
- Making training data more transparent, we released our training data distribution over metadata;
- A scalable algorithm running in the data pipeline, allowing to scale the data pool to the whole CommonCrawl (CC) w/ 300+B image-text pairs. We observe that data quality is much more important than quantity (different from existing open source efforts or ALIGN that mostly scale quantity);
- standard CLIP training setup for controlled experiments and fair comparisons under fixed training and model configuration.
We conclude that:
- Effective pretraining data should maximally preserve signal and mitigate noise, instead of hard removal of noise with blackbox filters that lead to unknown distribution
- Our algorithm is simpler and scalable to curate the whole Internet
- Open-sourcing does not just entail a trained model checkpoint but more importantly the pre-training data distribution.
@inproceedings{xu2023metaclip,
title={Demystifying CLIP Data},
author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu, Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
journal={arXiv preprint arXiv:2309.16671},
year={2023}
}
- 09/28/2023: initial release.
This code is developed with minimal changes on top of OpenCLIP. The following command should install requirements for OpenCLIP and submitit=1.2.1
used by this repo:
conda create -n python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \
-c pytorch-nightly \
-c nvidia \
-c conda-forge \
-c anaconda
MetaCLIP uses 500,000 queries as metadata to align the training data to distribution over quality writing of Wikipedia/WordNet terms. This metadata also allows us to release training data distribution of a released model as data card.
We change OpenCLIP to match training in the default CLIP model setup (w/ ViT-B-16-quickgelu, ViT-L-14-quickgelu and ViT-H-14-quickgelu). Most OpenCLIP models use nn.GELU
not quickgelu
used by vanilla CLIP. We hope this helps research w/ controlled experiments in the "CLIP era of ImageNet".
import torch
from PIL import Image
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-quickgelu', pretrained='metaclip/b32_400m.pt')
image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
Model | Data Card | IN ZS Acc. |
---|---|---|
MetaCLIP B32 400M | data card | 65.5 |
MetaCLIP B16 400M | data card | 70.8 |
MetaCLIP L14 400M | data card | 76.2 |
MetaCLIP B32 FullCC2.5B | data card | 67.6 |
MetaCLIP B16 FullCC2.5B | data card | 72.1 |
MetaCLIP L14 FullCC2.5B | data card | 79.2 |
MetaCLIP H14 FullCC2.5B | data card | 80.5 |
MetaCLIP G14 FullCC2.5B | data card | ongoing |
We have a demo notebook to show how the proposed algorithm works.
CLIP curation can still help as online balancing (Table 6 in the paper). We wrap CLIP curation in two key functions: substring matching (recommended to run offline) and balancing (either offline or online, please check metaclip.balancing:main
).
import json
import numpy as np
from metaclip.substr_matching import substr_matching
from metaclip.balancing import balance_sampling
with open("metadata.json") as f:
metadata = json.load(f)
# entry counts for our 1.6B(pool) -> 400M(curated); please check balance_sampling:main and substr match and count on your own data.
with open("metaclip/entry_counts_400m.json") as f:
entry_count_json = json.load(f)
entry_count = np.array([entry_count_json[entry] for entry in metadata], dtype=np.uint64) # uint64 to be safe for scaling.
t = 20000
entry_count[entry_count < t] = t
entry_prob = t / entry_count
for text in ["jacksons chameleon", "battery plate"]:
matched_entry_ids = substr_matching(text, metadata)
if balance_sampling(matched_entry_ids, entry_prob):
print(f"'{text}' curated")
We release a skeleton code for sub-string matching from CommonCrawl WAT or WARC and balancing. Check here for details.
python submitit_openclip.py b32_400m
Please config the corresponding training_data
in run_configs_400m.py
.
If you have any questions related to the code or the paper, feel free to email Hu Xu (huxu@meta.com
).
Please cite our paper if MetaCLIP helps your work:
@inproceedings{xu2023metaclip,
title={Demystifying CLIP Data},
author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu, Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
journal={arXiv preprint arXiv:2309.16671},
year={2023}
}
The training code is developed based on OpenCLIP, modified to the vanilla CLIP training setup.
- cross-json URL dedup in skeleton code;
- numpy implementation for matching and balancing;
- support online downloading;
- support vanilla CLIP API;
- Huggingface integration;
- (welcome your use cases or suggestions to update this codebase regularly)
The majority of MetaCLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.