apertium/lttoolbox

Use ICU

Closed this issue · 6 comments

How about we switch all I/O and wide char use to ICU instead? That would get rid of all the locale irritations and make the code more portable.

We already require ICU, both directly and indirectly. We could even get rid of PCRE in downstream tools.

I think it's about good time to start use ICU, I feel like last we discussed this some 10 years ago, I was against on grounds that ICU is quite large by default and not so standardly installed or installable even, I think as we already have experience of using it, it's no longer an issue.

Please please please can we do this? 🥺

Is there a roadmap to fix this (e.g. do ICU) in C++ itself? e.g. in std ?

Sort of. SG16: Unicode Direction is an overview of the work going into Standard C++ by SG16. The types are all there in C++20 (char8_t, char16_t, char32_t representing UTF-8/16/32), but none of the library is.

But they do conclude 2 important things: wchar_t is a portability deadend and In practice this means that we’ll need to ensure that proposals for new Unicode features are implementable using ICU.

So, wchar_t is bad and even Standard C++'s handling of Unicode would likely just forward to ICU.

But that library work is all slated for C++23 or later, which due to the 5 year lag means we can't widely use any of it until 2028 or 2031.

Thanks @TinoDidriksen for the excellent overview. In that case I think that moving to ICU is probably the right thing to do, even if it's really ugly. I agree with @flammie that the situation now is very different to what it was 10 years ago.

Re: icu being ugly, I wrote a wrapper for it when we ported lexd, you might
take a look at commit a5251bae0f935301ca9276e90c02e9f3262b9c0d for the port.
The wrapper provides a C++ iterator interface, instead of icu's C-like
iterator.

It's not complete, but it has worked pretty well for us.