elceef/dnstwist

Choice of fuzzy hash function

wiene opened this issue · 5 comments

wiene commented

To address #170 you kindly added support for TLSH. The present implementation chooses the used fuzzy hash function based on the available packages:

  • If ssdeep is available, it is chosen, else
  • if ppdeep is available, it is chosen, else
  • if tlsh is available, it is chosen, else
  • fuzzy hashing is switched off.

While I have no particular knowledge about fuzzy hashing, a quick internet search seems to suggest that typically ssdeep performs worse than other functions (see e. g. this paper). Therefore I was considering switching to TLSH for the Debian dnstwist package. Do you think this is a good idea?

The reason for opening #170 was the imminent removal of ssdeep from Debian. In the meantime a new Debian maintainer for ssdeep took over and fixed a build failure issue, such that the package is kept in Debian for the time being. This leaves me with the following situation:

If I switch from python3-ssdeep to python3-tlsh as recommended dnstwist package dependency in Debian, the used fuzzy hash function depends on whether the user might have installed python3-ssdeep or not. I think this is an undesirable situation since it might lead to confusion if people compare results obtained on different computers. Therefore I wonder whether adding a switch which allows to explicitly request a particular fuzzy hash function is a helpful feature.

If you have other ideas how to address this issue, suggestions are welcome. :-)

This is work in progress now. As you suggested, I'm going to introduce new command line argument --lsh [LSH], and deprecate --ssdeep but making it hidden for backward compatibility.

It's done. Please pull the latest code.
New --lsh [LSH] switch allows to choose from ssdeep and tlsh and selects ssdeep by default if none is provided. Argument --ssdeep is deprecated, hidden from the help screen, but still available.

wiene commented

Thanks a lot for your work. To perform a quick test I cherry-picked commit bf192bd and applied it on top of release 20221213. Unfortunately testing TLSH using this setup I occasionally ended up with the following error message:

Exception in thread Thread-14:
Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3/dist-packages/dnstwist.py", line 855, in run
    task['tlsh'] = int(100 - (min(tlsh.diff(self.lsh_init, lsh_curr), 300)/3))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: argument  is not a TLSH hex string

Sadly this issue does not seem to be 100 % reproducible.

I reproduced this issue using:

>>> tlsh.__version
0.2.0

It's been fixed in commit 81896c3.

wiene commented

Thanks a lot for your fix. I tried again including the changes from commit 81896c3. I can confirm that the issue reported yesterday has disappeared.