ssdeep_fuzzy_compare produces inconsistent results
mmaunder opened this issue · 2 comments
The following should produce 100 but produces 13:
select ssdeep_fuzzy_compare(ssdeep_fuzzy_hash('the quick brown fox jumped over the lazy dog'), ssdeep_fuzzy_hash('the quick brown fox jumped over the lazy dog'));
Shorter identical strings produce zero.
(Thanks for the great work by the way!)
Thank for testing the code Mark. This is a restriction of the ssdeep upstream library I am afraid.
A similar bug report was filed against the PHP PECL extension I also wrote ( http://pecl.php.net/bugs/bug.php?id=20348 ) so I will just include that here for explanation:
This a function of the ssdeep algorithm and not a bug. The author
of the ssdeep upstream package has previously made reference to the
algorithm becoming more accurate with content above 4KB in length.
Please try increasing the length of your sample text. Whilst investigating
your report I found that the following would return the result you
expect:
php > $hash =
ssdeep_fuzzy_hash('blahblahblahblahblahblahblahblahblahblahblahblah
blahblahfegfhgdhdghgdhgdshgsdhghgsdhgdshghsdhgdhgsjgdsjgdjgjgsgsghg
haasateytyuytkutdkusuht nmfbnzbnzbaerereyetywturyiuteutejbf
najetjhr gtjidahoadfh aiohjda hipdj hhadphjfgpahjapeghut9euhiotejhi
tjhe tjphjtejhgijdhkjhklghijst eih
eapsjhpjtephjtpjhptjpihjtihjidasfjh dhj dpasiojh poeatojh ohj
tpeojhpoaetjhoptejhoteajhotad
jhpoeatjhpotejhpoitejbjgji9rtsbiprpbjtaephetnhjetapihjpet eh peoaj
hpejpteajhegbmzcklhkghjgdj hhj thj teabnpteanmpaeotnmp[');
php > var_dump(ssdeep_fuzzy_compare($hash, $hash));
int(100)
Again though this extension is not intended to look for identical
strings and you should be using SHA1 or MD5 hashes if you need to
ensure they are the same. If you want to get a similarity match
then ssdeep is the right way to go.
Thanks, this is very helpful. I wasn't aware ssdeep didn't like short strings.