honungsburk/kombo

Support parsing at byte, codepoint and grapheme cluster level.

honungsburk opened this issue · 0 comments

Javascript uses UTF-16 which means that a string has three "different" units of lengths that are useful in different situations.

  • bytes: The number of actual bytes the string uses.
  • codepoints: UTF-16 is variable length encoded and 1 codepoint is either 2 or 4 bytes.
  • grapheme clusters: 1 or more codepoints, this is what users of the library think of as "characters"

Right now, I believe functions such as chompIf look at codepoints. This must be made more clear and we should add new combinators to parse a string at each level of fidelity: bytes, codepoints, or grapheme clusters.