Official PyTorch implementation of the CVPR 2022 oral paper, project website
- training projection script
- uploaded pre-trained models
- text-to-image generation notebook
- Typographic Attack Dataset released
The models used for text-to-image generation can be found here.
To use the pre-trained models:
W = torch.load(clip_subspace_path)['W'].data.cpu().numpy()
def project(x, W):
x_projected = x @ W
x_projected /= np.linalg.norm(x_projected, axis=1)[:, None]
return x_projected
For generating images we refer to the notebook of Kathrine Crowson https://colab.research.google.com/drive/15UwYDsnNeldJFHJ9NdgYBYeo6xPmSelP.
To generate images using the projection matrices, we compute the objective function using the projected features instead of CLIP feature vectors.
We plan to provide a modified notebook to show-case this application. The OCR detection was done using easyocr
toolbox.
We collected 180 images of 20 objects with 8 typographic attack labels.
The dataset is available at https://drive.google.com/drive/folders/17v74pmQs2zHrs9sJ_8D0bfrz5ykB5ZW2?usp=sharing.
The dataset contains annotation json file annotation.json
with format a list of files:
{"IMG_2934.JPG": {"true object": "cup", "typographic attack label": "good"}
. The dataset was released under the MIT license.
If you use this data, please cite the following papers:
@inproceedings{materzynskadisentangling,
Author = {Joanna Materzynska and Antonio Torralba and David Bau},
Title = {Disentangling visual and written concepts in CLIP},
Year = {2022},
Booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}
}