victordomingos/optimize-images

Catch more exceptions when Exif broken

Closed this issue · 11 comments

Hi,
Thanks for providing this useful tool, I'm using it to generate compressed images in NAS devices.

had_exif = True if piexif.load(t.src_path)['Exif'] else False

The 'piexif.load()' you used to get exif may raise a lot of different exceptions when exif head is broken(e.g. struct.error: unpack requires a buffer of n bytes). The program need to catch those, otherwise it will corrupted.

Hi! Thanks for your observations!

Nice to know that it is being also useful for you. Please, always keep in mind that the image optimization performed by this application is lossy/destructive. It was designed mainly to automate image compression for web, and not intended to process original files in place.

Regarding the exceptions, do you have some image(s) that may suffer from such corruption that could be used for testing and further development? I tried to keep it simple and let it run silently, as much as possible, but there could be cases where it stops. I would like to better investigate those cases, which I may not have faced yet.

Hi,
I used it on various NAS/IoT devices for security and privacy control, the compressing reduce the Internet I/O pressure. I cant give you example images due to privacy consideration, some of the images are collected from medical instruments.

There actually more exceptions need to catch, For example

img = Image.open(t.src_path)

This may raise IOError: cannot identify image file when image is broken which is usually happened in data streaming scenario.

Another problem

# elif img.format.upper() == 'JPEG':

You did not check the format here, but relay on the extension check in search_image.

for img_path in search_images(src_path, recursive=recursive))

Sometimes tiff or other images may be mistaken as *.jpg or *.png, those files will be included in Tasks and you do not really check the exact format in Image.format with PIL.

As this package is under MIT licence, I develop a special version to handle those problems and add more features.

  • I add a min size boundary, image under this size will not be compressed.
  • specify the number of workers, cpu_count() may not give correct data in some devices.
  • collect data from IO streaming rather than disk.
  • do not save temp file in disk, but use ByteIO to save them in memory. Disk IO for IoT is expensive.

Thanks for the framework you provided, I need to handle hundreds NAS and IoT devices with different platforms, some famous implementations can not be complied through all kinds of architectures. I used to try to develop a package by my own, but I found this pure python implementation with PIL. It really helps me !

Thanks very much for the feedback! I will try to look into those issues as soon as I can, but it could take some time, as I am currently in the middle of other activities.

By the way, in case you feel inclined to contribute back with some of your code, please feel free to submit a Pull Request.

Hello! I'm having the issue where it stops after a certain image with broken EXIF,

image

I need it to optimize a rather large (24GB) images folder for web.

Cheers,

José

I will try to issue an update soon that should catch some of those exceptions.

Meanwhile, it would really be helpful if someone could provide one of those problematic images (as long it does not contain any relevant data that should not be exposed), for testing purposes.

Sure! Can I have your email address?, for sending you the image privately

Please send it to:

editor [dot] arcosonline [at] gmail [dot] com

Thanks!

Sent, thank you very much for your help!

@kikohernandez92 @lizequn I will be releasing v.1.3.1 with a quick fix for the piexif Exception issue. At this time, I decided to simply extend the previous behaviour of ignoring EXIF in case an exception is raisen. I may take another look at this in case the current fix is not enough and a better solution is found.

I also added a few "todo" annotations regarding the other potential issues mentioned above. Right now I am more busy that I had expected, with less time to investigate these, but I intend to get back to them as soon as I can.

Meanwhile, in case someone feels ok with sharing some improvements, please feel free to submit a pull request. I promise to take a look ;-)

@lizequn Just a quick notice to let you know that, on the latest commit (not yet available in PyPI), I have implemented in-memory buffers using BytesIO, instead of in-disk temporary files.

We also had recently a new CLI parameter to specify the number of simultaneous jobs.

Currently, it saves the processed file if the file saving was at least 1% but we may provide some parameterisation for this in a future release.