treffynnon/lib_mysqludf_ssdeep

ssdeep_fuzzy_compare produces inconsistent results

mmaunder opened this issue · 2 comments

The following should produce 100 but produces 13:

select ssdeep_fuzzy_compare(ssdeep_fuzzy_hash('the quick brown fox jumped over the lazy dog'), ssdeep_fuzzy_hash('the quick brown fox jumped over the lazy dog'));

Shorter identical strings produce zero.

(Thanks for the great work by the way!)

Thank for testing the code Mark. This is a restriction of the ssdeep upstream library I am afraid.

A similar bug report was filed against the PHP PECL extension I also wrote ( http://pecl.php.net/bugs/bug.php?id=20348 ) so I will just include that here for explanation:

This a function of the ssdeep algorithm and not a bug. The author
of the ssdeep upstream package has previously made reference to the
algorithm becoming more accurate with content above 4KB in length.

Please try increasing the length of your sample text. Whilst investigating
your report I found that the following would return the result you
expect:

php > $hash = 
    ssdeep_fuzzy_hash('blahblahblahblahblahblahblahblahblahblahblahblah
        blahblahfegfhgdhdghgdhgdshgsdhghgsdhgdshghsdhgdhgsjgdsjgdjgjgsgsghg
        haasateytyuytkutdkusuht nmfbnzbnzbaerereyetywturyiuteutejbf 
        najetjhr gtjidahoadfh aiohjda hipdj hhadphjfgpahjapeghut9euhiotejhi 
        tjhe tjphjtejhgijdhkjhklghijst eih 
        eapsjhpjtephjtpjhptjpihjtihjidasfjh dhj dpasiojh poeatojh ohj 
        tpeojhpoaetjhoptejhoteajhotad 
        jhpoeatjhpotejhpoitejbjgji9rtsbiprpbjtaephetnhjetapihjpet eh peoaj 
        hpejpteajhegbmzcklhkghjgdj hhj thj teabnpteanmpaeotnmp[');

php > var_dump(ssdeep_fuzzy_compare($hash, $hash));                     

int(100)

Again though this extension is not intended to look for identical
strings and you should be using SHA1 or MD5 hashes if you need to
ensure they are the same. If you want to get a similarity match
then ssdeep is the right way to go.

Thanks, this is very helpful. I wasn't aware ssdeep didn't like short strings.