Add robust unicode support (probably via ICU bindings)
brson opened this issue · 10 comments
There's been lots of talk about unicode over the years. We have little support in the core libs, but need to provide something better for serious use. Best idea now is to wrestle libicu into a Rust crate. Start out-of-tree.
libicu has had security vulnerabilities in the past, is UTF-16, and provides a lot more than what we would need. https://github.com/lifthrasiir/rust-encoding is pretty mature at this point, and all of the data it uses is publically available. It might be a better idea to just have all unicode support outside of std in a libunicode
, and implement the algorithms from the Unicode spec in pure Rust.
Some vulnerabilities:
http://www.redhat.com/archives/rhsa-announce/2011-December/msg00037.html
http://www.redhat.com/archives/rhsa-announce/2009-June/msg00016.html
http://www.debian.org/security/2008/dsa-1511
Needless to say, the exact sort of thing Rust is great at preventing! This seems like a good place where we could even provide the same API as libicu and get Rust into the wider world.
I think this may end up depending on a) how much Unicode support we want to include (initially) b) how much interest there is in implementing it.
I'd like to note that lib{core, std} already contain some interesting things that are written almost completely in safe Rust code, e.g. case-folding, NF(K)D.
Personally if a non-libicu libunicode comes into existence I'd be more than willing to rewrite PR #12792 to be included in that, and time permitting implement some other algorithms.
As a Rust-outsider, the level of Unicode support really depends on
- how much space do you want to spend on Unicode support (some of the tables needed to do it properly are fairly large)?
- how much pain you want to endure while implementing it (some of the algorithms are pretty painful to write (at least without using existing code (haven't checked the ICU license lately, though)))?
So whatever you do, please ...
- use UFT-8. I'm sick of seeing other encodings leaking all over the place (especially in cases where people claimed "yes, we use encoding X internally, but we will make sure that we don't leak it").
- remember that the "length" of a string, in terms of human perception (not in terms of "amount of storage required"), depends on many things, including the locale. This also applies to most other Unicode operations as well. Design the API with that in mind, not as an afterthought.
Re 'how much unicode to support', we've long taken a stance that std should provide "some", and a separate crate provides "a lot" (whatever ICU does), and can be opted into. Where the line is drawn of what exactly goes into std is a matter of continual debate, but I want this issue to focus on the Unicode kitchen sink crate, and how to integrate it into the distribution.
To get an idea of what ICU actually provides I went over the documentation and made some notes.
Here they are in case anyone else finds them useful:
Character Properties
- look up character properties
- lots of them
- useful mostly to build other algorithms
StringPrep
- RFC 3454 https://tools.ietf.org/html/rfc3454
- locked to Unicode 3.2 (possibly requires separate tables)
- about to be replaced by PRECIS (PREperation and Comparison of Internationalized Strings) http://tools.ietf.org/wg/precis/
- framework with different profiles
- used in:
- XMPP (switching to PRECIS)
- IDNA2003 (replaced by IDNA2008)
- NFS4 (current bis-draft instead describes behaviour of implementations, no mention of StringPrep)
- still useful (for now), but I'd rather have it in a separate crate
Conversion
- converts between different encodings
- many non-Unicode
- Unicode: UTF-8, UTF-16{,LE,BE}, UTF-32{,LE,BE}, SCSU, BOCU-1, UTF-7, UTF-EBCDIC, CESU-8
Locales & Resources
- built-in concept of locales and locale-specific resources
- partially better suited for a separate crate, but to some extend needed for proper case-folding and collation
Date/Time
- IMHO belongs in a different crate
Formatting
- locale specific formatting (currency, dates, time, …)
- IMHO belongs in a different crate
Transforms
- case-mapping: lower-, upper-, title-case
- Full/Halfwidth conversion
- normalization (NFD, NFC, NFKD, NFKC)
- BiDi Algorithm
- custom
Collation
- sorting according to locale rules
Boundary Analysis
- find character (actually grapheme cluster), word, line-break, sentence boundaries
Layout Engine
- support for text rendering
- can read fonts
- does not actually render since that is platform specific, but provides a base class
- IMHO very specific and should be a separate crate/its own project
cc @lifthrasiir
Is anyone working on ICU bindings?
I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.
This issue has been moved to the RFCs repo: rust-lang/rfcs#797