open-i18n/rust-unic

Higher level APIs

zbraniecki opened this issue · 12 comments

Currently UNIC focused on character level internationalization, but ICU/CLDR support a lot of higher level APIs, several of them, very useful for localization.

I'm working on https://github.com/projectfluent/fluent-rs which is a port of http://projectfluent.org/ to Rust, and the main missing items are:

  • Plural Rules
  • Date Time Formatting
  • Number Formatting

Is there any plan for UNIC to extend to that level or do you think it's outside of the scope of this package?

CAD97 commented

I'll quote the README here: (I trust behnam's word more than mine)

The goal for UNIC is to provide access to all levels of Unicode and Internationalization functionalities, starting from Unicode character properties, to Unicode algorithms for processing text, and more advanced (locale-based) processes based on Unicode Common Locale Data Repository (CLDR).

So yes, eventually these algorithms should enter the UNIC family of crates. Right now, we're more focused on getting the remaining UCD properties mapped into crates, (and both behnam an myself don't have much time to give this project right now), but we'd welcome any PR / design work towards implementing further i18n algorithms that people will find useful.

How would it look like code structure wise?

I think a good first step would be an API similar to proposed JS Intl.Locale - https://github.com/tc39/proposal-intl-locale/

which would allow us to build, parse and operate on language tags. Since we're around Unicode, it would be later extended to support ICU/CLDR getLikelySubtags which is useful for language negotiation.

I'm currently maintaining a rust crate fluent-langneg which does a lot of that, and I'd like to extract the Locale class parsing/serializing to live outside and le fluent-langneg use it for a specific language negotiation approach.

Would it make sense to try to get the Locale API into UNIC?

Hey, Zibi! Sorry for the delay here! This is definitely a priority now and would be great to have you help with the API design.

One of the things we're trying to do here in UNIC is to keep the data model and API separate from the data itself. For example, for character properties, all the utilities and abstractions for Character Property live under unic-char, and then unic-ucd-* implement those abstractions. So, the goal is to not only make UCD data available to everyone, but also share the tooling used along the way, so everyone else can also define their own UCD-like properties as needed.

So, thinking about how that applies to Locale data, here's my first try on the abstractions...

We can start with unic-locale, providing data model for Locale identifiers.

  • unic-locale-lang, data model for ISO-639 and Unicode/BCP-47 rules.
  • unic-locale-region, data model for ISO-3166 and Unicode/BCP-47 rules.
  • unic-locale-script, data model for ISO-15924 and Unicode/BCP-47 rules.
  • etc.

Similar to other components, the bottom ones are standalone, and the higher-level ones build on top. I'm suggesting micro-size components, so that they can be re-used when only one little piece is needed. Also, we learned that it helps with having good API.

And another area would be making Locale data available, which would be mostly from CLDR. This would include the name translations, data formatters/parsers, calendar, etc. I guess we can name this unic-localedata or something like that.

Another thing about localedata is that, we probably want to enable builds limited to only one or a few locales. I think rust/cargo features are the only tool available for that, but I'd love to see a more advanced structure for it.

What do you think?

That sounds good!

I'm still fairly new to the Rust world, esp. land of monorepos and multipackages. Could you draft a POC of how such unic-locale-lang PR might look like for rust-unic? I'd be happy to take it from there and fill with correct algo, and build unic-locale on top.

@behnam - didn't hear back from you, and we started working on intl-pluralrules crate which we need for fluent-rs.

I'd be still interested in getting intl-locale crate analogous to https://github.com/tc39/proposal-intl-locale or what we use in Gecko as MozLocale - https://searchfox.org/mozilla-central/source/intl/locale/MozLocale.h

In your comment above you suggested separate crates for lang/region/script - is that only for well-formed, or would you also want to do validation inside?

Should it be 4 crates in result - one for each main subtag, one for full locale?

@zbraniecki @behnam I'd be super happy to contribute to a set of locale crates. I tend to work on minority language projects and repeatedly come across the lack of locales in Rust as an issue.

Being able to go from BCP-47 tag to a Locale object of some kind is super important for all kinds of applications.

It seems to me next steps would be to have a cldr crate and add support to unic-gen to generate .rsv files?

CAD97 commented

That sounds reasonable. Feel free to ping me on the Gitter or on a GitHub issue/PR, I'll be happy to mentor you through the unic-gen structure. I can't promise a swift response but I'm usually on in some capacity.

@unclenachoduh is working on the plural rules: https://github.com/unclenachoduh/pluralrules

There are three crates in it:

  1. CLDR parser
  2. Rule generation (takes CLDR AST and produces Rust functions)
  3. intl-pluralrules - the actual production crate for selecting plural rules

We'll be releasing the first versions this week.

The crates mentioned before have now been released and in use by fluent-rs for almost half a year.

  1. https://crates.io/crates/cldr_pluralrules_parser
  2. https://crates.io/crates/make_pluralrules
  3. https://crates.io/crates/intl_pluralrules

The first one is just an UTS#35 LDML compatible parser for the rules, the second generates Rust code and the third is the public facing one. I keep maintaining them up to date with the latest CLDR release, and would be interested in upstreaming the work to unic.

Is that something that unic should incorporate? If so, what would be the process?

Thanks for the update, @zbraniecki. Glad to see these crates are published!

What I have in mind as the Locale model for UNIC is a superset of Unicode/CLDR Locales, so we can address some of the limitations of CLDR Locales. For example, CDLR Locales are not capable of expressing an English-language text with Mexican Pesos being rendered as "$".

And from the API perspective, I'm implementing the Territory model, as explained in #234, to be the basis of the Territory part of the Locale. (Since a couple of days ago, I'm preparing the source data for that.)

I'm taking a bottom-up approach in this case, to fill in the gaps I think need attention. But no object on my side to merge what's available already into UNIC. We can always find a way to expose CLDR's plural rules for UNIC Locales and basically merge the functionalities and stabilize the API then.

Would having an unstable API for now be sufficient for fluent-rs? If not, I would recommend keeping the current pluralrules set separate for now until we have the UNIC Locale. (I know it's been taking a long time, but for now it's mostly me. I can write down more about the planning if anyone wants to get more involved.)

What do you think?


Btw, I just set up couple of days ago a release-based CLDR git repo: https://github.com/open-i18n/data-unicode-cldr

@bcmyers released num-format crate which is likely going to fit right into the NumberFormat space - https://www.reddit.com/r/rust/comments/anaykb/announcement_new_crate_numformat/

@behnam - any progress on the Locale API?

I filed #266 to discuss unic-langid and unic-locale.