mrjgreen/db-sync

[Feature] Flag to change hash algoritm

Closed this issue · 5 comments

Currently the default hash algorithm for checking similarity seems to be MD5. Since MD5 and SHA1 has known collision flaws, I'm a bit worried about using MD5 as a diff hash strategy.

It would be nice if I could specify another hash algorithm in the CLI. Perhaps SHA-224 could be offered? CRC-32 also appears to be a poor choice for comparison.

Nice idea - not sure why I never included that as an option.

As a first step I'll put in a '--hash <crc32|md5|sha1>' flag, leaving md5 as the default to avoid any BC breaks.

I'm interested to discuss other algorithms further.

The algorithm is only hashing small chunks of rows (--block-size default:1024) so the chances of there being data differences that cause a hash collision should be small, even with MD5.

I am certain that I have seen hash collisions with CRC32, so perhaps this algorithm should be discouraged, but on the other hand its nice and fast... so for big tables comparing just a few fields it could be useful.

Happy to include extra algorithms - fancy submitting a PR with your recommendations?

'--hash (-H) <crc32|md5|sha1>' opt added in a058245

Actually, my recommendation is to select a hashing algorithm that fits the selected --block-size and the similarities of the data one wants to sync.

There is no best algorithms but it possible to calculate collision probability. Talking about CRC32, plumless and buckeroo supposedly give the same hash. http://preshing.com/20110504/hash-collision-probabilities/

I have a bunch of columns with only default values. I should probably exclude them from my hashing via --ignore-comparison. I think you have all the tools to do a correct synchronization.

If I find a magic algorithm, I will update the issue but for now - everything looks good 👍

@mrjgreen Can you please publish a new release with the -H flag? I would like to use composer on the production server and not git clone your project.

Thanks!

Sure! I've published release v3.3.0 which includes the new --hash option

https://github.com/mrjgreen/db-sync/releases/tag/v3.3.0