wooorm/nspell

Unexpected capital letters returned for certain capitalized misspellings

Closed this issue · 3 comments

For steps to quickly reproduce with a related gist, see the original issue I've opened on retext-spell.

TLDR: see PR 38 and PR 39 that I've opened against this repo.

Background

I have found over 200 5-letter dictionary words for which retext-spell produces suggestions such as the following:

  • It suggests "ThreE" when given "Threa", or in general /^Thre[f-ln-racuvxyz]$/
  • It suggests "MaJor" when given "Maxor" or in general /^Ma[d-ho-raluwxz]or$/
  • It suggests "ChYme" when given "Cheme" or in general /^Ch[d-hj-np-xabz]me$/

There is nothing to suggest this problem is limited to 5-letter dictionary words, as I've stumbled across the following:

  • It suggests "TinpOt" when given "Tinph" or in general /^Tinp[bcdfhlmqvx]$/

One example I've tested

  • assume the input word is "Twinz"
  • assume character ais "t" at index 0
  1. "T" is uppercase in "Twinz".
  2. The first group is "qwertzuop".
    1. We know the position of "t" in the group is 4.
    2. We split the word "Twinz" at position 4 into before="Twin" and after = ""
    3. We iterate thrice to group character b="e"
      • We select an uppercase "E" for index 4 because "T" was uppercase at index 0

This is how we arrive at the confusing suggestion of "TwinE".

Problem

When creating the initial array of edits based on keyboard groups, the current logic of nspell is as follows:

  • For a given character in an input word, we convert it to lowercase to obtain a
  1. Check whether character a is uppercase in the input word
  2. Iterate through all the keyboard groups containing character a
    1. Store the position of character a within the group
    2. Split the input word at character a's position within the group (note, this defaults to splitting at the very end of the word when the character's position in the group exceeds the length of the input word)
    3. Iterate through each group character b != a

This results in uppercase characters inserted at seemingly random positions in the text because the position of character a within the group is not the index of character a within the input word.

Note, the notion of "Keyboard groups" doesn't really make sense in the current approach in main. The suggest function inserts each replacement character at a position (of a in group) that is unrelated to the original position (of a in word). This "root problem" is ignored in PR 38 and addressed in PR 39.

Awesome, thanks for digging in and doing work on these two PRs!

I see that both PRs are WIP: what can we do to make them actionable? How can I help?

I see that both PRs are WIP: what can we do to make them actionable? How can I help?

The most helpful feedback you could give right now would be how I should approach the test cases:

  • Is it OK if the tests that fail are essentially the same except that the suggestions are in a different order?

After that, I'd like to hear your thoughts on what seems to be broken keyboard groups logic. For example, the input Ghandi leads to an initial "edit" of Grandi even though h and r are not in the same keyboard group. Is it even important that the keyboard groups do not work as expected? Essentially, if that issue were fixed (like in PR #39), we would not need a separate fix for this capitalization issue.

I've only set the WIP status because the tests are failing. If backwards compatibility for certain arbitrary test cases is not an issue, then we would just need to update the tests. PR #38 is failing the tests less severely than PR #39, though I think PR #39 makes more logical sense.

I've spent some time today looking into how to make PR #39 match the tests more closely, and it really seems that one test case in particular (test #46) is really reliant on replacements made by what seems like a buggy implementation of replacements based on keyboard groups.

I see that both PRs are WIP: what can we do to make them actionable? How can I help?

In relation to a new issue I've raised, check out updates I've made on PR #39.

I think it's almost ready!