tensorflow/datasets

Corrupt files in the `dogs_vs_cats` dataset

Closed this issue · 24 comments

Short description
I encountered this bug during my TensorFlow certification exam, when trying to work with images from the dataset you constantly get the message Corrupt JPEG data: 228 extraneous bytes before marker 0xd9 again and again, and it takes forever to iterate over the data once with that, I couldn't complete my exam because of that.

Environment information

  • Operating System: window 10
  • Python version: 3.7.4
  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets version 3.1.0
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow-gpu version 2.2.0

Reproduction instructions
A very simple way to reproduce the bug:

dataset_name = 'cats_vs_dogs'
dataset, info = tfds.load(name=dataset_name, 
                          split=tfds.Split.TRAIN,
                          with_info=True)

for i in dataset:
    print(i)

Expected behavior
I except to be able to iterate over all the images without getting errors and without it taking forever to complete a single iteration.

@tomergt45 I am unable to reproduce the bug.

As far as I can see, all corrupt images are removed already.

if num_skipped != _NUM_CORRUPT_IMAGES:

And printing is an I/O operation, so time it will a lot of time to print of array of 20000+ images.

The problem happens when you just iterate over the data (without printing):

>>> import tensorflow_datasets as tfds
>>> tfds.__version__
'3.1.0'
>>> dataset = tfds.load('cats_vs_dogs')
>>> dataset = [i for i in dataset['train']]
Corrupt JPEG data: 214 extraneous bytes before marker 0xd9
Corrupt JPEG data: 228 extraneous bytes before marker 0xd9
Corrupt JPEG data: 396 extraneous bytes before marker 0xd9
Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Warning: unknown JFIF revision number 0.00
Corrupt JPEG data: 252 extraneous bytes before marker 0xd9
Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Corrupt JPEG data: 239 extraneous bytes before marker 0xd9

It also happens when you try to fit a model with this data.

@tomergt45 I will not able to reproduce the error. @vijayphoenix is correct, all 1738 corrupt images were skipped, see this colab.

@Eshan-Agarwal It still happens to me, I tried updating tfds to version 3.2.0 but I still get the same messages, any idea why or how I can fix it? :/

full example:

>>> import tensorflow_datasets as tfds
2020-07-23 18:13:21.705749: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
>>> tfds.__version__
'3.2.0'
>>> dataset = tfds.load('cats_vs_dogs')
2020-07-23 18:13:46.057890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-23 18:13:46.157089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-07-23 18:13:46.168293: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 18:13:46.224450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 18:13:46.267127: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 18:13:46.289066: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 18:13:46.338939: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 18:13:46.368177: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 18:13:46.444874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 18:13:46.451455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-23 18:13:46.458140: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-23 18:13:46.490376: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x23027dcb440 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-23 18:13:46.497384: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-23 18:13:46.503795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-07-23 18:13:46.514591: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 18:13:46.519846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 18:13:46.525979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 18:13:46.532067: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 18:13:46.539066: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 18:13:46.545074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 18:13:46.551015: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 18:13:46.556295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-23 18:13:48.573072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-23 18:13:48.578039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-07-23 18:13:48.582122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-07-23 18:13:48.586650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8589 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-07-23 18:13:48.602605: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x230518534a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-23 18:13:48.609512: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
>>> dataset = [i for i in dataset['train']]
Corrupt JPEG data: 214 extraneous bytes before marker 0xd9
Corrupt JPEG data: 228 extraneous bytes before marker 0xd9
Corrupt JPEG data: 396 extraneous bytes before marker 0xd9
Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Warning: unknown JFIF revision number 0.00
Corrupt JPEG data: 252 extraneous bytes before marker 0xd9
Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Corrupt JPEG data: 239 extraneous bytes before marker 0xd9

However, I can reproduce this issue on windows.

I encountered this bug during my TensorFlow certification exam yesterday.

It is possible that the code to auto-detect corrupted images do not works on windows:

if tf.compat.as_bytes("JFIF") not in fobj.peek(10):

Or maybe there are additional corrupted images on windows that works on linux ?

Unfortunately, I do not have access to any windows computer, so I'm can't really debug this. If someone want to help us investigate this, it would be great.

@Conchylicultor I tried checking it out, I added some print calls in each function of the CatsVsDogs class, and when running this code:

import tensorflow_datasets as tfds
dataset = tfds.load('cats_vs_dogs')
dataset = [i for i in dataset['train']]

Only the print in the _info function was called, I may be missing something, but perhaps the line you referenced earlier isn't being executed?

Edit: I'd like to point out I am not very familiar with how the TensorFlow Datasets API is structured.

@tomergt45 Thanks for looking into this.
The generation code is only executed once the first time the dataset is generated, afterward, the generated files are reused. To force executing the generation, you can delete the existing generated files (in ~/tensorflow_datasets/cats_vs_dogs/).

After investgiating a bit, I managed to get the names of the corrupted images that was not skipped using this code:

import tensorflow_datasets as tfds
import py, sys

dataset = tfds.load('cats_vs_dogs')

capture = py.io.StdCaptureFD()

corrupt_images = []
for x in dataset['train']:
    _, err = capture.readouterr()    
    if err:
        corrupt_images.append(x['image/filename'].numpy().decode())

which gave me the following output:

'PetImages\\Dog\\10880.jpg',
'PetImages\\Dog\\164.jpg',
'PetImages\\Cat\\11279.jpg',
'PetImages\\Dog\\11124.jpg',
'PetImages\\Dog\\621.jpg',
'PetImages\\Cat\\497.jpg',
'PetImages\\Cat\\8051.jpg',
'PetImages\\Dog\\6754.jpg',
'PetImages\\Dog\\3176.jpg',
'PetImages\\Cat\\9813.jpg',
'PetImages\\Cat\\10838.jpg',
'PetImages\\Dog\\4956.jpg'

Hope this helps.

EDIT: That's very weird but every time you execute this code you get diffrent file names, I'm not sure why.

It is very like that this is because of the following:
tf.io.gfile with python zipfile results in corruption of the data. (For some reason Windows only)
Similar issue #2539

For more info tensorflow/tensorflow#32975

neoh1 commented

I encountered this today while training a VGG model using
Cubbli/Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-126-generic x86_64)
tensorflow 2.3.0 (tensorflow-gpu)
tensorflow_datasets (4.2.0) (installed with pip)
python 3.7 in anaconda3 environment
dataset = tfds.load('cats_vs_dogs', split=tfds.Split.TRAIN, data_dir='data/').

224/727 [========>.....................] - ETA: 41s - loss: 0.6927 - accuracy: 0.5188Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
261/727 [=========>....................] - ETA: 38s - loss: 0.6925 - accuracy: 0.5247Warning: unknown JFIF revision number 0.00
273/727 [==========>...................] - ETA: 37s - loss: 0.6923 - accuracy: 0.5250Corrupt JPEG data: 396 extraneous bytes before marker 0xd9
317/727 [============>.................] - ETA: 33s - loss: 0.6916 - accuracy: 0.5286Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
365/727 [==============>...............] - ETA: 29s - loss: 0.6907 - accuracy: 0.5312Corrupt JPEG data: 252 extraneous bytes before marker 0xd9
366/727 [==============>...............] - ETA: 29s - loss: 0.6907 - accuracy: 0.5312Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
382/727 [==============>...............] - ETA: 28s - loss: 0.6905 - accuracy: 0.5332Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
541/727 [=====================>........] - ETA: 15s - loss: 0.6876 - accuracy: 0.5492Corrupt JPEG data: 214 extraneous bytes before marker 0xd9
644/727 [=========================>....] - ETA: 6s - loss: 0.6851 - accuracy: 0.5580Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
661/727 [==========================>...] - ETA: 5s - loss: 0.6846 - accuracy: 0.5594Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
675/727 [==========================>...] - ETA: 4s - loss: 0.6841 - accuracy: 0.5607Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
711/727 [============================>.] - ETA: 1s - loss: 0.6834 - accuracy: 0.5625Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
719/727 [============================>.] - ETA: 0s - loss: 0.6832 - accuracy: 0.5624Corrupt JPEG data: 228 extraneous bytes before marker 0xd9

Similar problem:
Ubuntu 20.04.3 LTS 5.13.0-27-generic x86_64
python 3.9 anaconda environment
tensorflow 2.7.0 (installed with pip)
tensorflow_datasets 4.5.2 (installed with pip)

import tensorflow_datasets as tfds
dataset = tfds.load('cats_vs_dogs', split=tfds.Split.TRAIN)
list(dataset.as_numpy_iterator())

gives

2022-02-01 00:39:04.607453: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-01 00:39:05.039021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8267 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:42:00.0, compute capability: 7.5
Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Warning: unknown JFIF revision number 0.00
Corrupt JPEG data: 396 extraneous bytes before marker 0xd9
Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Corrupt JPEG data: 252 extraneous bytes before marker 0xd9
Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Corrupt JPEG data: 214 extraneous bytes before marker 0xd9
Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Corrupt JPEG data: 228 extraneous bytes before marker 0xd9

So, is there some solution to this?
I have the same problem on latest Debian11 and Windows11 with latest software.

Hi

No solution have been found now ?

Same here, on Ubuntu 18.

The only thing that worked for me was using this software to filter the images: https://github.com/coderslagoon/BadPeggy

Same on macOS Ventura :(

2+ years later, still being an issue.

I'm running the TensorFlow: Advanced Techniques Specialization Coursera Course 1 Week 4 quiz.

My env is

  • Ubuntu 20.04.4 LTS
  • Python 3.8.15
  • tensorflow 2.9.0
  • tensorflow-datasets 4.7.0

I encounter this today.. any solution?

I came across the same problem too. I had downloaded the dataset from Kaggle and tried running it on my local machine. But when I called model.fit() the training stopped with error.

My solution was to write a code to try and open files and if there is any error, remove the required file. Also, if the number of channels (or dimensions) in the image are not 3 (reed, green, blue channels) then also I will remove the file. After running this code on the dataset I was able to get the model to train without any issues.

My code:

from pathlib import Path
from tensorflow.io import read_file
from tensorflow.image import decode_image

# data_dir is of type Path and points to the parent dir
# parent dir contains the directories 'Dog' and 'Cat'
# run the same code for the dir 'Cat' to remove corrupt files 
for image in sorted((data_dir/'Dog').glob('*')):
    try:
        img = read_file(str(image))
        img = decode_image(img)
        
        if img.ndim != 3:
            print(f"[FILE_CORRUPT] {str(image).split('/')[-1]} DELETED")
            image.unlink()
            
    except Exception as e:
        print(f"[ERR] {str(image).split('/')[-1]}: {e} DELETED")
        image.unlink()

I have seen a similar error in JPEG reading functions of several libraries, not just tensorflow, so I think this is an error in the underlying image decoding library employed. You can get around this issue by re-encoding and writing the JPEG images. It's an expensive operation, but you should only need to do it once.

I manipulated the image removal function provided for the dataset. On my machine, this fixed the Corrupt JPEG error. Note also that my directory name is "data/cats_dogs", which is different than the default directory name.

import os
import tensorflow as tf
from tensorflow.io import read_file, write_file
from tensorflow.image import decode_image

should_rewrite_image = True # set to true if you are getting Corrupt Data error
num_skipped = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join('data/cats_dogs', folder_name)
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        is_jfif = True
        should_remove = False
        
        with open(fpath, "rb") as fobj:
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
            
        try:
            img = read_file(fpath)
            if not tf.io.is_jpeg(img):
                should_remove = True
                
            img = decode_image(img)

            if img.ndim != 3:
                should_remove = True

        except Exception as e:
            should_remove = True
        
        if (not is_jfif) or should_remove:
            num_skipped += 1
            # Delete corrupted image
            os.remove(fpath)
        elif should_rewrite_image:
            tmp = tf.io.encode_jpeg(img)
            write_file(fpath, tmp)

print("Deleted %d images" % num_skipped)

Hope this helps others as a workaround.

Python 3.10:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import py
import tensorflow as tf
from tensorflow.io import read_file, write_file
from tensorflow.image import decode_jpeg

num_deleted = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join("PetImages", folder_name)
    for index, fname in enumerate(os.listdir(folder_path)):
        capture = py.io.StdCaptureFD()
        fpath = os.path.join(folder_path, fname)
        is_jfif = True
        should_remove = False
        try:
            fobj = open(fpath, "rb")
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
        finally:
            fobj.close()

        img_d = None
        try:
            img = read_file(fpath)
            if not tf.io.is_jpeg(img):
                should_remove = True

            img_d = decode_jpeg(img)

            if img_d.ndim != 3:
                should_remove = True

        except Exception as e:
            print('ERROR', fpath, str(e))
            should_remove = True

        _, err = capture.reset()
        if err and 'Corrupt JPEG data' in err:
            should_remove = True
            print('ERROR', fpath, err)

        if not is_jfif or should_remove:
            num_deleted += 1
            # Delete corrupted image
            os.remove(fpath)

A simple working code:

from tensorflow.io import read_file
from tensorflow.image import decode_image
import glob
import os
data_dir = '/data/Cat/*.jpg'
for image in sorted(glob.glob(data_dir)):
        img = read_file(str(image))
        img = decode_image(img)
        if img.shape[2] != 3:
           print(image)
           os.remove(image)