huggingface/transformers

Regression in CLIPProcessor from 4.24.0 -> 4.25.0.dev0

patrickvonplaten opened this issue · 0 comments

System Info

  • transformers version: 4.24.0 / 4.25.0.dev0
  • Platform: Linux-5.18.10-76051810-generic-x86_64-with-glibc2.34
  • Python version: 3.9.7
  • Huggingface_hub version: 0.11.0.dev0
  • PyTorch version (GPU?): 1.11.0+cpu (False)
  • Tensorflow version (GPU?): 2.9.1 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.6.0 (cpu)
  • Jax version: 0.3.16
  • JaxLib version: 0.3.15
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@amyeroberts @sg

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

There seems to be a regression of CLIPProcessor between current main and 4.24

You can easily reproduce it by running the following script with current main 4.25.0.dev0 and 4.24 to see a difference:

#!/usr/bin/env python3
from transformers import CLIPProcessor
import transformers
from PIL import Image
import PIL.Image
import numpy as np
import torchvision.transforms as tvtrans
import requests
from io import BytesIO

print(transformers.__version__)

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

BICUBIC = PIL.Image.Resampling.BICUBIC
image = image.resize([512, 512], resample=BICUBIC)
image = tvtrans.ToTensor()(image)

np_image = np.asarray(image)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

pixel_values = processor(images=2 * [np_image], return_tensors="pt").pixel_values

print(pixel_values.abs().sum())
print(pixel_values.abs().mean())

The outputs for the different versions are as follows:

4.24.0
tensor(287002.5000)
tensor(0.9533)
4.25.0.dev0
tensor(503418.8125)
tensor(1.6722)

The code snippet above comes from reproducing a problem that happens when updating transformers to main for https://github.com/SHI-Labs/Versatile-Diffusion .
https://github.com/SHI-Labs/Versatile-Diffusion only works with transformers==4.24.0 - the pipeline gives random results when using transformers==4.25.0.dev0

Expected behavior

It seems like a bug was introduced for after the 4.24. release. The code snippet above might seem a bit edge-casy but I believe people have started to build any kind of image processing pipelines with CLIP already.