glaslos/ssdeep

Is it inconsistent with the C version implemented by the author?

BingqingXue1 opened this issue · 9 comments

When we compared your code with the author's C code, we found that the calculation results were inconsistent. We found that a function was not implemented. Is this a bug?

static int has_common_substring(const char *s1, size_t s1len, const char *s2, size_t s2len)
{
  size_t i, j;
  uint32_t hashes[SPAMSUM_LENGTH - (ROLLING_WINDOW - 1)];

  if (s1len < ROLLING_WINDOW)
    return 0;
  if (s2len < ROLLING_WINDOW)
    return 0;

  // there are many possible algorithms for common substring
  // detection. In this case I am re-using the rolling hash code
  // to act as a filter for possible substring matches

  memset(hashes, 0, sizeof(hashes));

  // first compute the windowed rolling hash at each offset in
  // the first string
  struct roll_state state;
  roll_init (&state);

  for (i = 0; i < ROLLING_WINDOW - 1; i++)
    roll_hash(&state, (unsigned char)s1[i]);
  for (i = ROLLING_WINDOW - 1; i < s1len; i++)
  {
    roll_hash(&state, (unsigned char)s1[i]);
    hashes[i - (ROLLING_WINDOW - 1)] = roll_sum(&state);
  }
  s1len -= (ROLLING_WINDOW - 1);

  roll_init(&state);

  // now for each offset in the second string compute the
  // rolling hash and compare it to all of the rolling hashes
  // for the first string. If one matches then we have a
  // candidate substring match. We then confirm that match with
  // a direct string comparison */
  for (j = 0; j < ROLLING_WINDOW - 1; j++)
    roll_hash(&state, (unsigned char)s2[j]);
  for (j = 0; j < s2len - (ROLLING_WINDOW - 1); j++)
  {
    roll_hash(&state, (unsigned char)s2[j + (ROLLING_WINDOW - 1)]);
    uint32_t h = roll_sum(&state);
    for (i = 0; i < s1len; i++)
    {
      // confirm the match after checking potential match
      if (hashes[i] == h && !memcmp(s1 + i, s2 + j, ROLLING_WINDOW))
	  return 1;
    }
  }

  return 0;
}

If you think this function needs to be implemented, I can implement it and submit a PR

Do you have an example of a hash mismatch?

The function you are mentioning is only used when comparing two hashes, so it will not have an impact on the calculated hash.

Yes, It only affects the distance of ssdeep hash.
For example , the one hash is 24:YDVLfvBD0D1b8Hlapg2MbMdI6hPZ3QCw4qat:YDbM8HlOUbMu6HACwVat. And
the other is 24:YDVLfzQM/QEzpg7RMdlgx6h4Z3QCw4qat:YDmM/QO6MI6uACwVat.
We calculate the score of ssdeep to be 65. But the value we calculated using the author's C program is 60. The reason is that go does not implement this function.

Then this is definitely missing. Thank you for raising it! You can either just add a test that fails for now or implement the extra condition. Otherwise I'll add it later this week.

No problem. I'll raise PR tomorrow or the day after tomorrow.

Here is PR #34

Awesome, thanks a lot! I will prepare a new release later today!

Released 0.4.0 with the fix by @BingqingXue1 Thank you for your contribution!