Regression in CLIPProcessor from 4.24.0 -> 4.25.0.dev0

Question

Regression in CLIPProcessor from 4.24.0 -> 4.25.0.dev0

patrickvonplaten opened this issue 2 years ago · 0 comments

patrickvonplaten commented 2 years ago

System Info

transformers version: 4.24.0 / 4.25.0.dev0
Platform: Linux-5.18.10-76051810-generic-x86_64-with-glibc2.34
Python version: 3.9.7
Huggingface_hub version: 0.11.0.dev0
PyTorch version (GPU?): 1.11.0+cpu (False)
Tensorflow version (GPU?): 2.9.1 (False)
Flax version (CPU?/GPU?/TPU?): 0.6.0 (cpu)
Jax version: 0.3.16
JaxLib version: 0.3.15
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@amyeroberts @sg

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

There seems to be a regression of CLIPProcessor between current main and 4.24

You can easily reproduce it by running the following script with current main 4.25.0.dev0 and 4.24 to see a difference:

#!/usr/bin/env python3
from transformers import CLIPProcessor
import transformers
from PIL import Image
import PIL.Image
import numpy as np
import torchvision.transforms as tvtrans
import requests
from io import BytesIO

print(transformers.__version__)

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

BICUBIC = PIL.Image.Resampling.BICUBIC
image = image.resize([512, 512], resample=BICUBIC)
image = tvtrans.ToTensor()(image)

np_image = np.asarray(image)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

pixel_values = processor(images=2 * [np_image], return_tensors="pt").pixel_values

print(pixel_values.abs().sum())
print(pixel_values.abs().mean())

The outputs for the different versions are as follows:

4.24.0
tensor(287002.5000)
tensor(0.9533)

4.25.0.dev0
tensor(503418.8125)
tensor(1.6722)

The code snippet above comes from reproducing a problem that happens when updating transformers to main for https://github.com/SHI-Labs/Versatile-Diffusion .
https://github.com/SHI-Labs/Versatile-Diffusion only works with transformers==4.24.0 - the pipeline gives random results when using transformers==4.25.0.dev0

Expected behavior

It seems like a bug was introduced for after the 4.24. release. The code snippet above might seem a bit edge-casy but I believe people have started to build any kind of image processing pipelines with CLIP already.