/prose

Pure-Haskell proper unicode string handling

Primary LanguageHaskellBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Build Status

Pure-Haskell proper unicode strings

λ> graphemes "བོད་ཀྱི་སྐད་ཡིག།"
["བོ","ད","་","ཀྱི","་","སྐ","ད","་","ཡི","ག","།"]

See prose-lens for a lens interface.

segmentation:
  ✓⃞ grapheme
  ✓⃞ words  ⃞ tailored
  ⃞ sentences  ⃞ tailored
  ⃞ line-breaking  ⃞ tailored

normalization:
  ✓⃞ NFD ✓⃞ NFKD ✓⃞ NFC  ⃞ NFKC

collating  ⃞ …
transformation  ⃞ …
character properties  ⃞ …
other cldr  ⃞ …


HCAR entry:

Many programming languages offer non-existing or very poor support for Unicode. While many think that Haskell is not one of them, this is not completely true. The way-to-go library of Haskell’s string type, Text, only provides codepoint-level operations. Just as a small and very elementary example: two “Haskell café” strings, first written with the ‘é’ character, and the second with the ‘e’ character followed by a combining acute accent character, are obviously have a correspondence for many real-world situations. Yet they are entirely different and unconnected things for Text and its operations.

And even though there is text-icu library offering proper Unicode functions, it has a form of FFI bindings to C library (and that is painful, especially for Windows users). More so, its API is very low-level and incomplete.

Prose is (work-in-progress) pure Haskell implementation of Unicode strings. Right now it’s completely inoptimized. Implemented parts are normalization algorithms and segmentation by graphemes and words.

Numerals is pure Haskell implementation of CLDR (Common Language Data Repository, Unicode’s locale data) numerals formatting.

Further reading

http://lelf.lu/prose https://github.com/llelf/prose https://github.com/llelf/numerals


optimizations: none

Prose/𝘚 ICU
segmentation/graphemes one-lang text 1.60ms 0.47ms
segmentation/graphemes chars sample 15.84ms 16.30ms