Invalid space characters

Question

Closed this issue 6 years ago · 1 comments

The current scanner assumes that U+200C and U+200D are spaces, where they are actually joiners and should not be handled as spaces.

The Unicode standard specifically states that U+2060 must be ignored for "word segmentation".

Similar with U+180E. The standard states that "MVS is not a suffix but an integral part of
the word stem"

Answer 1 · 2018-07-22T20:51:00.000Z

Updated the EBNF, but nothing in the code yet