jjmontesl/codenamize

Duplicate word(s)

Opened this issue · 8 comments

"dusty" appears twice, maybe others? ¯_(ツ)_/¯

Absolutely right...

There is more than one non-unique adjectives:

  • thirsty (2 times)
  • tense (2 times)
  • sore (2 times)
  • silent (2 times)
  • petite (2 times)
  • lively (2 times)
  • dusty (2 times)
  • cute (2 times)
  • bright (2 times)

The nouns list seems to be ok.

I confirmed that removing these duplicates will cause a different random result (tested on concode).

So, not sure what is the best course of action:

  1. Keep as is.
  2. Change and break the persistence of previous codes.

It definitely will.

I think it's reasonable to introduce a backwards-compatibility breaking change. My proposal:

  • Instead of removing the adjectives, replace them by new adjectives and document this. This wouldn't break current hashes backwards, except for the 50% of the affected words.
  • Optionally, we can use a comparative/superlative version of the adjectives so it's easy to realize that they were formerly the duplicated ones (/ej thirsty/thristier, tense/tenser, sore/... I am unsure about some of them). I vote yes to this.
  • Optionally, keep the current list and offer a compatibility mode (I don't think this is necessarily good). I vote no to this.

What do you think?

What do you think?

Well. I am always in favor of simple solutions that do not leave you with a "bad taste" and definitely ones that do not create additional or alternative technical debt.

So, my opinion is to do one of these:

  1. Option 1: Avoid the breaking change and leave as is.
  2. Option 2: Accept the breaking change and simply remove the duplicates.
  3. Option 3: Embrace the breaking change and remove the duplicates, and use this opportunity to add or update the dictionaries even further.

Any further movement on this? I've ported Codenamize to Kotlin and while dealing with a bug caused by the white space in the 'ad hoc' adjective, I found a duplicate for dusty. I am not sure I can see how using the comparative/superlative would work in this case, since it means using additional characters which breaks part of the algorithm, no?

I feel so bad for this bug.

After considering this again I'll vote for @DannyBen 's option 1 above.

This effectively reduces the hashing space and increases the chance of collisions a bit, but I think it's the least breaking change.

If this approach caused a limitation to any client, I propose to start working on option 3 and please open a new Issue in order to update the list of adjectives for a future codenamize hash version (removing duplicates and adding a few adjectives).

In any case, for compatibility with current hashes, we need to keep the current list as-is at least for the current version.

(I'll keep this issue open for anyone hitting this until we settle on a new future codenamize hashing format).

Please let me know of your Kotlin port when ready to link it from the home page. I'll also add notes to README and code to inform about the reduced hashing spaces and non-linear collision chances. Again, being a hashing library, though, I think that maintaining backwards compatibility is more important than reducing collision chances.

My 0.02: I use it primarily for human-readable identifiers for what is backed by full hashes elsewhere. The risk of collisions on short prefixes is decently high to begin with - I don't expect it to be collision resistant even on human scales.

I definitely think they should be changed at some point.

I don't really like the idea of keeping them in the same semantic-space, I'd pick wholly different ones. In fact you might want to check with a tool like editdistance for near-collisions.

Regardless of path taken, changing the word list is definitely a semver-major-breaking-change.