michielbuddingh/spamsum

Short SpamSum values not matching through Compare function

Closed this issue · 1 comments

I'm trying to test spamsum, but seeing the following results:

  sscompare -compare -string1 "the quick brown fox jumped over the lazy dog." -string2 "the quick brown fox jumped over the lazy dog."
  byte string 1:  [116 104 101 32 113 117 105 99 107 32 98 114 111 119 110 32 102 111 120 32 106 117 109 112 101 100 32 111 118 101 114 32 116 104 101 32 108 97 122 121 32 100 111 103 46]
  hash1:  3:UkLKKI6myFRc5:UAIp+o
  byte string 2:  [116 104 101 32 113 117 105 99 107 32 98 114 111 119 110 32 102 111 120 32 106 117 109 112 101 100 32 111 118 101 114 32 116 104 101 32 108 97 122 121 32 100 111 103 46]
  hash2:  3:UkLKKI6myFRc5:UAIp+o
  comparison result:  13

The code:

func compareStrings(byteval1 []byte, byteval2 []byte) {
     hash1 := hashString(byteval1)
     hash2 := hashString(byteval2)
     fmt.Println("comparison result: ", hash1.Compare(*hash2))
  }

Other short strings seem to be failing with inconsistent behaviour.

Spamsum was originally designed as a checksum for spam; in that use case you want to prevent false positives, and I think that's why the compare function has a built-in limit that gets lower when the block size gets smaller.

For very small block sizes, like these ones, it's simply not possible to get a similarity score higher than 13, even if the strings match 100%.

Whether this is a sensible decision or not is, for the purposes of this implementation, irrelevant, as it aims to be a 100% reimplementation of spamsum.

Again, due to my negligence in answering this question promptly, I'll close this issue and request that you open a new one if this answer is unsatisfactory.