atom/node-spellchecker

Handling of Unicode suggestions

Opened this issue · 8 comments

I'm looking at a couple problems with Atom's spell-check module and I think I've traced a problem down to this library. In specific, passing in words to GetCorrectionsForMisspelling doesn't seem to handle encoding correctly.

In this case, I'm using a German word möchten which is spelled correctly. If I run the test via the command line (Ubuntu with a de-DE dictionary installed), I get this:

$ hunspell -d /usr/share/hunspell/de_DE
Hunspell 1.4.0
möchten
*

möchte 
+ möchten

$

This is pretty much the output I'm expecting to see. If I do the same with a short JS script, however:

// Create the spelling wrapper and add the German dictionary to it.
var spell = require("./lib/spellchecker");
spell.setDictionary("de-DE", '/usr/share/hunspell');

// Test some function calls.
console.log("isMisspelled:", spell.isMisspelled("möchten"));
console.log("checkSpelling:", spell.checkSpelling("möchten"));
console.log("getCorrectionsForMisspelling:", spell.getCorrectionsForMisspelling("möchten"));

I get the following:

$ node test.js
isMisspelled: true
checkSpelling: [ ]
getCorrectionsForMisspelling: [ 'm�chten' ]
$

From the CLI version, this isn't a misspelled word, it isn't marked as an error though (checkSpelling), but the suggestions don't come through. I noticed that checkSpelling has a slightly more complex signature for converting from UTF-16 to UTF-8 but the others don't.

isMispelled only has a boolean output, which kind of suggests that the Unicode string going in is not being translated in hunspell's expected format. Since "möchten" is correct, I suspect that the same problem exists in getCorrectionsForMisspelling in addition to the output converted might need encoding.

I looked at the resulting hex output from the getCorrectionsForMisspelling and got <Buffer 6d ef bf bd 63 68 74 65 6e>. The ef bf bd should have been the umlaut character, however it is being returned as a placeholder character in UTF-8 encoding. I think this suggests that we might be dealing with a different encoding there.

Sadly, I couldn't figure it out. I spent a few hours messing with the gyp and C++ but haven't figured out the specifics to get it working. I tried converting getCorrectionsForMisspelling to use the UTF-16 format but I don't think I did it correctly since I'm just monkey-bashing code to see if to works.

Any help would be appreciated.

It's interesting to see how much work is done in the hunspell CheckSpelling() wrapper vs. the other method wrappers:

std::vector<MisspelledRange> HunspellSpellchecker::CheckSpelling(const uint16_t *utf16_text, size_t utf16_length) {

CheckSpelling() takes its parameters as UTF16, where the others just take const std::string& word. And based on this, all strings in the Javascript environment start out as UTF16. https://kev.inburke.com/kevin/node-js-string-encoding/

So I suppose the solution would be to treat all incoming strings in spellechecker_hunspell.cc as UTF16, then convert them to UTF8 before sending them to hunspell?

I'm looking at this again because there are a number of issues being opened up on atom/spell-check. I'm not sure if the UTF16 to UTF8 is the entire answer. I've added a few new tests, mostly using the German dictionary and trying to get the tests that break to help isolate the problem. I think I have the tests correct, a bit of validation wouldn't hurt there. :)

I noticed that the tests use de_DE but my Linux machine needed de_DE_frami to work properly. Not sure if that is right.

I ended up focusing on a single test using jasmine-focused and fit inside the test to test for "Kein Kine möchten möchte" (Kein and möchten are correct, Kine and möchte are not). Running those tests, I added a number of std::cout to spellchecker_hunspell.cpp to try identifying the problem. (All my debugging is prefixed with DREM ).

So far what I've noticed is that the loop doesn't see the UTF16 words. It gets an "unknown" state and stops processing the string. However, if I add setlocale(LC_ALL, "en_US.utf8");, then it picks them up however it doesn't correctly identify the words.

At this, I'm still not sure how to fix this.

My current WIP is at https://github.com/dmoonfire/node-spellchecker/tree/unicode-checking. You can grab it and run the tests and see my DREM output and a breakdown.

@dmoonfire I might have found something interesting...

Curiously enough, this also happens when I try to use the Hunspell library with GCC.

I'm working on an Electron application, and one of the requirements is spell checking. I could not build the application using electron-spellchecker for some reasons, so I've decided to look for another library. I've found nodehun. Nodehun has its own native module implementation, but it only took a few unit tests to find out that I have the exact same errors you have, when I try to verify accented Brazilian Portuguese words.

I've decided to check if the same errors happen with the C++ library, and they do (my .cxx file's encoding was UTF-8, and I've tested the library with both accented chars and \u codes in the literals).

Then I remembered I should check the encoding of my local hunspell .aff and .dic files (at /usr/share/hunspell), and I was surprised to see that both are encoded in ISO-8859 (-1, probably).

So I've figured out that I could test two other things:

  1. I could re-encode the .cxx file to ISO-8859-1 and test my program again. Unsurprisingly, all results were correct after I've done this (except for the output, since my terminal is naturally configured to use the UTF-8 charset).

  2. Better than that, I could re-encode the .aff and .dic files to UTF-8, and leave my .cxx file in UTF-8. This resulted in multiple console errors of the form error: line 1176: multiple definitions of an affix flag. A quick search took me to a very interesting bug report at hunspell... So I've followed Caolan Mcnamara's advice and added the FLAG=UTF-8 in my UTF-8-encoded .aff file, and guess what... Everything worked as expected - both my C++ program and my Electron/Node.js program...

It seems all I have to do now is to distribute this "fixed" UTF-8-encoded .aff file with my UTF-8-encoded .dic file, and I'll have correct spell checking in my Electron application... :)

By the way, if this is actually fixes things for other people too, maybe there is another thing to consider: these UTF-8 hunspell dictionaries do not have the FLAG=UTF-8 in their .aff files; well, the Brazilian Portuguese one doesn't. Considering that hunspell bug report, though, I think maybe they should add it, and the encoding instructions might be updated too.

I don't know if this information is useful for you, but I hope this helps somehow.

s-m-e commented

@dmoonfire

Kein and möchten are correct, Kine and möchte are not

"möchte" is in fact correct ("Ich/er möchte ..." == "I/he want[s] ..."). It's first and third person singular of "möchten" ("to want"). If you're looking for something "nearby" that actually does not exist, try "möchter" or "möcht" for instance.

Please also have a look at issue 161 against atom/spell-check. It confirms @rmarianni findings. Most Linux distributions tend to ship Hunspell dictionaries encoded with "ISO8859-1". I am running openSUSE Leap 42.3 for instance - *.aff dictionary files explicitly have a line stating "SET ISO8859-1" at their top. If re-enconding those *.aff files in UTF-8 solves the issue as it does for a number of people (see above mentioned issue), I'd almost guess that the encoding instructions in the *.aff files are simply not honored by node-spellchecker ... (?)

@s-m-e: Yeah, I realized I was using the wrong word. My latest efforts in #95 use a different word. I had a problem with the setlocale call inside the Hunspell code in relationship to the dictionaries. If both the file and the locale are UTF8, it works just fine but I'm not sure how to fix it properly that doesn't require reencoding the dictionaries.

Also, those fixes don't work when using the built-in spell checker for Windows 10 or Mac OS. I'm plugging at it, but C isn't my native language anymore and it's taking me a while. I apologize.

s-m-e commented

@dmoonfire What's the current status with respect to this bug?

@s-m-e: Mostly at a point of frustration, to be honest. I have #95 but it seems to be failing on the Mac side when prefer Hunspell is off. I'm not sure how to investigate this, though, mainly because I don't have access to an Mac to work out what is going wrong and, as I said, C++ is not my native language anymore. I'm not sure how to move forward with this.

@maxbrunsfeld do you have any bandwidth to help out with this? I have a sneaking suspicion that this might be an issue around text-buffer?