c3rb3ru5d3d53c/binlex

tlsh fuzzy hashing

c3rb3ru5d3d53c opened this issue ยท 14 comments

Include Fuzzy Hashing

This should allow easy comparison of byte strings and trait strings for similarity.

libtlsh-dev package should do it :)

Add method and code to common.c. Interface?

This is now merged, @herrcore and @pisco-sour, I'm going to swap this issue to you guys for implementation. ๐Ÿ˜„

raw.cpp, pe.cpp and blelf.cpp need the functions ReadBuffer and ReadFile.

They all need to produce the json keys file_sha256, file_tlsh.

Include tlsh in cmake build?

Including the file-based data into the traits output is a nightmare. The decompiler outputs the traits but it does not known anything about the file, only sections are known to it. Introducing another global variable is a recipe for disaster IMHO.
Solutions could be:

  1. Decompiler is initialised with a reference to a file, it will always know from where it startet. Then, of course a decompiler should not be reused for another file. Is this ever done?
  2. Make methods with which to set file information in the decompiler, like set_tlsh, set_sha256, etc. They could be used if they were provided.
  3. Call WriteTraits with a reference to the file and take the information from this reference.
  4. Reorganise code completely to have a file class containing the the raw data, hashes, etc. The decompiler is initialised with this file class and can take its information from there. This could even be a "has a" relation.

I would favour the last solution but this would introduce some major code changes and the decompilers would have to be rewritten in huge parts. As we do not have that much time I do think this is feasible. Hence I would suggest solution 3, this can be done with little effort does not brake anything and makes kinda sense.

Decompiler handles multiple types of files.
Yeah, I agree the decompiler should not be reused by another file.

One approach I was thinking about was to have only one extra function used only for python binding... one that sets the command line arguments. Then the python binding code just runs one function which contains what main() has, without the parameter processing. This way both CLI and python binding have similar way of interaction.

We should elaborate in the next standup.

Yeah, I think WriteTraits/GetTraits should be separate so we can execute these methods as well as set the appropriate data structures across multiple decompilers. The datastructure for the traits will be the same across all decompilers anyway.

OK, the changes seem to be working so far. But I had to put the usage of the file hashes in the WriteTraits function as this is the only member function, the GetTraits... functions are static. Do they really need to be?

Closing for now ๐Ÿ˜„ we can still continue improvements!