Hash mismatch with reference implementation
zhenjl opened this issue · 11 comments
Hi, thanks for the pure go library for ssdeep. I was curious if you have compared results with github.com/dutchcoders/gossdeep. I just did a quick test and it seems like the results are slightly diff:
"12288:+AxbaNI5pVxkQw3iNjQzTgLNh7EMusK8NiftLV1rq:+HITCchFHtNiftvq" (expected)
"12288:+AxbaNI5pVxkQw3iNjQzTgLNh7EMusK8NiftLV1rqs:+HITCchFHtNiftvqs" (actual)
First one is from @dutchcoders, second is from yours.
Hi, thanks for your interest. You have probably noticed that gossdeep is just a wrapper around the official implementation. I still have some issues in mine (as you have rightful noticed) with handling some of the corner cases.
Any progress on this issue? Do you have an idea where the problem is?
Hi @davidt99 unfortunately I did not had time to look into this yet. How familiar are you with Go? I could give you some pointers that get you started investigating the problem.
To be clear, the dutchcoders implementation is a wrapper of the original library.
I just started writing in Go, but I feel comfortable enough with the language to try and fix the problem, so if it's not too much trouble, post the pointers and I will try to fix it.
@dutchcoders implementation uses "C" package (I guess there is no other option), and that has a performance penalty for every call - that why I'm interested in using your implementation.
Yes, there are a bunch of downsides to using cgo: https://dave.cheney.net/2016/01/18/cgo-is-not-go
In https://github.com/glaslos/ssdeep/blob/master/ssdeep.go#L136 we call processByte
which appends one more character to the hash. You basically have to find out why the original implementation stops one character earlier.
You might have noticed when comparing the results that our implementation is appending some extra characters to the hash. Might be a good starting point to find the root cause.
This seems to be more significant with smaller files.
I think I fixed it, but I need to do more testing. Do you have some sort of control group that I can use?
I can help with that since I filed the original bug :)
I think I was able to reproduce the mismatch you described, but while I was trying to solve it, I found another mismatch (in a different location).
I want to be more sure that I fixed all the bugs, that's why I want more than just a few files to test with.
If you have a test group, please post it. The sample you first use is also a good start.