Choice of fuzzy hash function
wiene opened this issue · 5 comments
To address #170 you kindly added support for TLSH. The present implementation chooses the used fuzzy hash function based on the available packages:
- If
ssdeep
is available, it is chosen, else - if
ppdeep
is available, it is chosen, else - if
tlsh
is available, it is chosen, else - fuzzy hashing is switched off.
While I have no particular knowledge about fuzzy hashing, a quick internet search seems to suggest that typically ssdeep
performs worse than other functions (see e. g. this paper). Therefore I was considering switching to TLSH for the Debian dnstwist
package. Do you think this is a good idea?
The reason for opening #170 was the imminent removal of ssdeep
from Debian. In the meantime a new Debian maintainer for ssdeep
took over and fixed a build failure issue, such that the package is kept in Debian for the time being. This leaves me with the following situation:
If I switch from python3-ssdeep
to python3-tlsh
as recommended dnstwist
package dependency in Debian, the used fuzzy hash function depends on whether the user might have installed python3-ssdeep
or not. I think this is an undesirable situation since it might lead to confusion if people compare results obtained on different computers. Therefore I wonder whether adding a switch which allows to explicitly request a particular fuzzy hash function is a helpful feature.
If you have other ideas how to address this issue, suggestions are welcome. :-)
This is work in progress now. As you suggested, I'm going to introduce new command line argument --lsh [LSH]
, and deprecate --ssdeep
but making it hidden for backward compatibility.
It's done. Please pull the latest code.
New --lsh [LSH]
switch allows to choose from ssdeep
and tlsh
and selects ssdeep
by default if none is provided. Argument --ssdeep
is deprecated, hidden from the help screen, but still available.
Thanks a lot for your work. To perform a quick test I cherry-picked commit bf192bd and applied it on top of release 20221213. Unfortunately testing TLSH using this setup I occasionally ended up with the following error message:
Exception in thread Thread-14:
Traceback (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/usr/lib/python3/dist-packages/dnstwist.py", line 855, in run
task['tlsh'] = int(100 - (min(tlsh.diff(self.lsh_init, lsh_curr), 300)/3))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: argument is not a TLSH hex string
Sadly this issue does not seem to be 100 % reproducible.