Add robust unicode support (probably via ICU bindings)

Question

Add robust unicode support (probably via ICU bindings)

brson opened this issue 11 years ago · 10 comments

There's been lots of talk about unicode over the years. We have little support in the core libs, but need to provide something better for serious use. Best idea now is to wrestle libicu into a Rust crate. Start out-of-tree.

Answer 1 · 2014-06-04T23:15:27.000Z

libicu has had security vulnerabilities in the past, is UTF-16, and provides a lot more than what we would need. https://github.com/lifthrasiir/rust-encoding is pretty mature at this point, and all of the data it uses is publically available. It might be a better idea to just have all unicode support outside of std in a libunicode, and implement the algorithms from the Unicode spec in pure Rust.

Answer 2 · 2014-06-04T23:17:18.000Z

Some vulnerabilities:

http://www.redhat.com/archives/rhsa-announce/2011-December/msg00037.html
http://www.redhat.com/archives/rhsa-announce/2009-June/msg00016.html
http://www.debian.org/security/2008/dsa-1511

Needless to say, the exact sort of thing Rust is great at preventing! This seems like a good place where we could even provide the same API as libicu and get Rust into the wider world.

Answer 3 · 2014-06-05T15:23:58.000Z

I think this may end up depending on a) how much Unicode support we want to include (initially) b) how much interest there is in implementing it.
I'd like to note that lib{core, std} already contain some interesting things that are written almost completely in safe Rust code, e.g. case-folding, NF(K)D.
Personally if a non-libicu libunicode comes into existence I'd be more than willing to rewrite PR #12792 to be included in that, and time permitting implement some other algorithms.

Answer 4 · 2014-06-05T19:55:24.000Z

As a Rust-outsider, the level of Unicode support really depends on

how much space do you want to spend on Unicode support (some of the tables needed to do it properly are fairly large)?
how much pain you want to endure while implementing it (some of the algorithms are pretty painful to write (at least without using existing code (haven't checked the ICU license lately, though)))?

So whatever you do, please ...

use UFT-8. I'm sick of seeing other encodings leaking all over the place (especially in cases where people claimed "yes, we use encoding X internally, but we will make sure that we don't leak it").
remember that the "length" of a string, in terms of human perception (not in terms of "amount of storage required"), depends on many things, including the locale. This also applies to most other Unicode operations as well. Design the API with that in mind, not as an afterthought.

Answer 5 · 2014-06-05T22:26:28.000Z

Re 'how much unicode to support', we've long taken a stance that std should provide "some", and a separate crate provides "a lot" (whatever ICU does), and can be opted into. Where the line is drawn of what exactly goes into std is a matter of continual debate, but I want this issue to focus on the Unicode kitchen sink crate, and how to integrate it into the distribution.

Answer 6 · 2014-06-07T05:27:46.000Z

To get an idea of what ICU actually provides I went over the documentation and made some notes.
Here they are in case anyone else finds them useful:

Character Properties

look up character properties
lots of them
useful mostly to build other algorithms

StringPrep

RFC 3454 https://tools.ietf.org/html/rfc3454
locked to Unicode 3.2 (possibly requires separate tables)
about to be replaced by PRECIS (PREperation and Comparison of Internationalized Strings) http://tools.ietf.org/wg/precis/
framework with different profiles
used in:
- XMPP (switching to PRECIS)
- IDNA2003 (replaced by IDNA2008)
- NFS4 (current bis-draft instead describes behaviour of implementations, no mention of StringPrep)
still useful (for now), but I'd rather have it in a separate crate

Conversion

converts between different encodings
many non-Unicode
Unicode: UTF-8, UTF-16{,LE,BE}, UTF-32{,LE,BE}, SCSU, BOCU-1, UTF-7, UTF-EBCDIC, CESU-8

Locales & Resources

built-in concept of locales and locale-specific resources
partially better suited for a separate crate, but to some extend needed for proper case-folding and collation

Date/Time

IMHO belongs in a different crate

Formatting

locale specific formatting (currency, dates, time, …)
IMHO belongs in a different crate

Transforms

case-mapping: lower-, upper-, title-case
Full/Halfwidth conversion
normalization (NFD, NFC, NFKD, NFKC)
BiDi Algorithm
custom

Collation

sorting according to locale rules

Boundary Analysis

find character (actually grapheme cluster), word, line-break, sentence boundaries

Layout Engine

support for text rendering
can read fonts
does not actually render since that is platform specific, but provides a base class
IMHO very specific and should be a separate crate/its own project

Answer 7 · 2014-06-07T05:40:31.000Z

cc @lifthrasiir

Answer 8 · 2014-10-03T17:48:46.000Z

Is anyone working on ICU bindings?

Answer 9 · 2015-01-25T21:10:02.000Z

@Jurily, https://gist.github.com/ArtemGr/91e88de7e17fbc571926

Answer 10 · 2015-02-02T10:58:02.000Z

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#797