JohannesBuchner/imagehash

[Question] Recommended KNN/ANN index for large datasets

misotrnka opened this issue · 7 comments

I would like to use CropResistantHash to quickly find near-duplicates from a large set of reference images.

With other hash functions I would normaly use some kind of approximate nearest neighbor index, such as NMSLib or Annoy. The challenge is that CropResistantHash is variable length and cannot be compared using one of the standard distance functions (Angular, Hamming, Manhattan, ...).

Can anyone point me to an alternative solution? How do you use this with large datasets?

I suppose you can store it in a database with a 1:N image-hash mapping, and then do a equality query? I think some databases support hamming distances etc.

It is probably more performant to limit the bits so that you can make a equal test rather than going for a nearest neighbour search.

I'm not sure if checking for equality would be sufficient for our use case, as we need certain level of precision.

I'm thinking of maybe using ANN to index individual region hashes and then use that to narrow down options for the full difference scan.

Still, can store it in a database table the 1:N image-hash (segment_hashes in cropresistanthash) mapping and search using databases supported functions like hamming distances which should be fast.

I can try that, but I believe that any distance function in a DB would rely on full sequential scan of the table. We are talking about hundreds of millions of rows here, so I think some kind of index is neccessary to narrow the options down a bit. But thank you for the idea, I'll explore it.

@misotrnka were you able to perform deduplication at 100M+ scale? I'm also trying to do something similar. If you could please share any pointers those would be valuable. I am thinking about inserting the perceptual hash in a db and doing a distinct select. We would be okay with certain loss of images in the process so long as it is not outrageously wrong.