ssdeep-project/ssdeep

Getting bucket values that build the hash

Sryborg opened this issue · 1 comments

Hello @jessek ,

I have been using LSH (TLSH, ssdeep) methods for malware detection for a while now.
I have hit a point where I need to do faster inference/search.

Is there some paper or document that I can refer which can help me convert a hash to its vector form? I need to do this inorder to build an ANN index on a SSDEEP cluster centroids.
I've looked a a lot of resources of the web, but none of them help in extracting the actual bucket value. Most of them talk about using the actual ssdeep hash(plain text) to implement solutions similar to edit distances.

Any if you could point me to any helpful resources, i'd be really greatful. :)

a4lg commented

Sorry for breaking your first assumption but, ssdeep is not a bucket-based LSH.

Possibly the simplest explanation of a ssdeep hash is, splitting the entire file with suitable probability and taking the hash of each "piece" (the lower bound of the piece-splitting probability is determined based on the file size and increased until the final hash would be the "right" size or the upper bound of the probability is reached). Each piece makes one "character" on the block hash component of the fuzzy hash (with a few exceptions).

See Jesse's original paper and the documentation of my personal RIIR project of ssdeep, describing some of the ssdeep internals:
https://docs.rs/ffuzzy/0.3.8/ssdeep/struct.FuzzyHashData.html#fuzzy-hash-internals