mk-fg/image-deduplication-tool

Fails on ` in filename

aadis opened this issue · 7 comments

DEBUG:root:Processing path: ./inc/A BC`s0.gif
sh: -c: line 0: unexpected EOF while looking for matching' sh: -c: line 1: syntax error: unexpected end of file sh: -c: line 0: unexpected EOF while looking for matching'
sh: -c: line 1: syntax error: unexpected end of file

[CImg] *** CImgIOException *** [instance(0,0,0,0,0x0,non-shared)] CImg::load(): Failed to recognize format of file './inc/A BC`s0.gif'.
Traceback (most recent call last):
File "/Users/aaditya/bin/image_matcher.py", line 195, in
if name == 'main': main()
File "/Users/aaditya/bin/image_matcher.py", line 192, in main
if optz.reported_db is not None: optz.reported_db.sync()
AttributeError: 'bool' object has no attribute 'sync'

Two issues, actually - backtrace at the end belongs to the second one, which should now be fixed (as of c2299f7).

But the initial problem is mostly unrelated to that.
And images with backticks in names seem to work for me (though there are some limitations on filename characters if you use --feh option with default --feh-args). In fact, I just used the filename you had and it worked fine:

DEBUG:root:Processing path: images/052fc493a77b0228f05de9cc6906ad44-d2zdl2h.jpg
DEBUG:root:Processing path: images/image`02`.jpg
DEBUG:root:Processing path: images/A BC`s0.gif
DEBUG:root:Processing path: images/image`01.jpg
DEBUG:root:Calculating/sorting Hamming distances
images/image`02`.jpg images/image`01.jpg 28
images/A BC`s0.gif images/image`01.jpg 30
images/052fc493a77b0228f05de9cc6906ad44-d2zdl2h.jpg images/image`01.jpg 32
images/A BC`s0.gif images/052fc493a77b0228f05de9cc6906ad44-d2zdl2h.jpg 32
images/image`02`.jpg images/052fc493a77b0228f05de9cc6906ad44-d2zdl2h.jpg 32
images/image`02`.jpg images/A BC`s0.gif 32

What's more creepy is the "sh -c" running from somewhere within CImg and pHash invocaton.
During that "Processing path:" phase, only "libpHash.ph_dct_imagehash()" functions are being run, so it should be spawning from these, and looking at what it does there...

% strace -eexecve -f ./image_matcher.py . 2>&1 | grep /bin/sh
[pid 14075] execve("/bin/sh", ["sh", "-c", "convert \"./.gitignore\" pnm:-"], [/* 87 vars */]) = 0
[pid 14076] execve("/bin/sh", ["sh", "-c", "convert \"./image_matcher.db\" pnm"...], [/* 87 vars */]) = 0
[pid 14077] execve("/bin/sh", ["sh", "-c", "convert \"./image_matcher.db.bak\""...], [/* 87 vars */]Process 14078 attached
[pid 14078] execve("/bin/sh", ["sh", "-c", "convert \"./README.md\" pnm:-"], [/* 87 vars */]) = 0
[pid 14079] execve("/bin/sh", ["sh", "-c", "gm convert \"./README.md\" pnm:-"], [/* 87 vars */]) = 0
[pid 14080] execve("/bin/sh", ["sh", "-c", "gm convert \"./image_matcher.db.b"...], [/* 87 vars */]) = 0
[pid 14081] execve("/bin/sh", ["sh", "-c", "gm convert \"./image_matcher.db\" "...], [/* 87 vars */]) = 0

Yuck!

Apparently, it does that for the formats that aren't recognized by CImg/libpHash natively, so I guess one easy workaround would be to pre-filter images to make sure there're only valid formats.
That looks like a security risk, too - give it an non-image with rm -rf /* in the name and it might not end well.

But that said, I don't feel like poking and fixing either CImg (huge chunk of C++ in a single header file) or libpHash, so my advice would be for you to do something like "convert" (imagemagick tool) on all the files and only run the tool on those that were processed successfully, so that the aforementioned libs won't have to chew them further.

Guess I can merge a patch for an option that does such pre-screening, but I think it's out of scope of the tool and a wrong way to go - proper soluton would be to look into the libs and fix them, otherwise there's no telling when they'll choke next.

I'll add a warning to the README, but unless you have some better idea on how to deal with that, the issue should probably be closed as "wontfix".

I know, and I was horrified about invoking sh -c without proper quoting! Luckily I didn't any have filenames with rm * in them :)

I'll log some bugs with CImg and/or libpHash.

Perhaps a command line switch to ignore the files that are failing, by catching the exception?

Hm, indeed, I'm not sure how errors there are handled (and if checked at all), and will look into it, though I guess errors should be non-fatal by default.

I've noticed that libpHash produced "0" as hash for some images in the past, but don't remember looking much into why (as there were only few cases), while in fact it might be indication that it can't process the image.
Gotta look at the docs/code as well, after all.

Thanks for the pointer.

Interesting thing to note is that there's relatively new (~1 month old) release of libpHash that may as well not have any problems. I'm still using 0.9.4, which is like 2 years old, so there's a good chance issue might've been noticed.

EDIT:
Nope, libpHash has nothing to do with it.

Even the latest (1.5.3) CImg, on the other hand, has the code like this in several places:

      if (!cimg::strcasecmp(f_type,"pnm")) load_pnm(filename);
      else if (!cimg::strcasecmp(f_type,"pfm")) load_pfm(filename);
      else if (!cimg::strcasecmp(f_type,"bmp")) load_bmp(filename);
      else if (!cimg::strcasecmp(f_type,"jpg")) load_jpeg(filename);
      else if (!cimg::strcasecmp(f_type,"pan")) load_pandore(filename);
      else if (!cimg::strcasecmp(f_type,"png")) load_png(filename);
      else if (!cimg::strcasecmp(f_type,"tif")) load_tiff(filename);
      else if (!cimg::strcasecmp(f_type,"inr")) load_inr(filename);
      else if (!cimg::strcasecmp(f_type,"dcm")) load_medcon_external(filename);
      else throw CImgIOException("CImg<%s>::load()",
                                 pixel_type());
    } catch (CImgIOException&) {
      try {
        load_other(filename);
      } catch (CImgIOException&) {
        cimg::exception_mode() = omode;
        throw CImgIOException(_cimg_instance
                              "load(): Failed to recognize format of file '%s'.",
                              cimg_instance,
                              filename);
      }

And "load_other" does basically:

  try { load_magick(filename); }
  catch (CImgException&) {
    try { load_imagemagick_external(filename); }
    catch (CImgException&) {
      try { load_graphicsmagick_external(filename); }
      catch (CImgException&) {
        try { load_cimg(filename); }
        catch (CImgException&) {

SF tracker doesn't seem to have a Bug for it, too, which, as I've looked at the code already, guess I'll create, if you won't beat me to it. I think I'll propose adding something like and env-var there to control this fallback, but not yet sure if it's sane, gotta think on it some more.

Filed a ticket about the issue with CImg - https://sourceforge.net/p/cimg/bugs/49/ (just fyi).

Thanks!

Given current resolution of CImg issue in question, I guess there won't be any cleaner way to fix the problem w/o rebuilding pHash with new CImg.h, though I'll probably file a report to pHash to add ./configure option for pHash to disable shell fallback when it will eventually be closed.

As of aa7a272, script shouldn't stumble on files which cause errors and just print something like "Failed to get image hash ('images/A BC`s0.gif'): [ENOENT] No such file or directory" to stderr (along with CImg output there) and treat the file as unhashable and skip it's processing in the future if same --hash-db is used.

Thanks for reporting the thing (might save some trouble for CImg users, in particular).
I'll close the issue, as I don't think there's much more to be done here, but feel free to reopen it if you have more ideas on the matter.