respecting locale and unicode charsets

Question

respecting locale and unicode charsets

atstp opened this issue 9 years ago · 9 comments

Just a central place to handle the stuff from #22 and #21: respecting/handing locales and charsets.
#21 settled on (at least) some prefixed functions with support for specific types.

This is about handling generic functions like lower, upper, word?, words, letters?, caseless=, etc.

It looks like there are 4 approaches:

use naive host functions - it's simple and easy to build
use the most capable host function- it'll grow as the host language evolves
hard-code character ranges - too much work and too little payoff (at least for me for now)
set ^:dynamic defaults - people can get crazy if they want (or if they just know more than me)

my personal preference is using the most capable host functions. Making the defaults dynamic would be okay too, but i'm not pushing for it.

Answer 1 · 2016-06-30T05:32:24.000Z

After thinking about this, my unique concern to be aware of the locale in the "use the most capable host function" approach, is that that host functions uses the host locale and not one specified by user.

I imagine the mostly common situation where the server is in USA and client is in Turkey, the host uses the USA locale, so the functions will be limited to the USA stuff and can't be aware of a specific locale. But, implement all locale stuff in cuerdas I think is too much (maybe I'm wrong and it can be done easily but I doubt about that). So, I'm not clearly see advantages on using host locale aware functions when they solve only small portion of the problem.

I'm wrong?

Answer 2 · 2016-07-04T18:25:41.000Z

What would you think about having a ^:dynamic var named something like *use-locale* (false by default)?

With this, there would be "safe" behaviour that we could build quickly now, and in the future, it could be rebound to either true or the host's specific locale, like Locale.US for jvm users?

this would be easy to implement right away, and set things up for a smooth(ish) transition to better unicode support, which will get better in the next few years.

Answer 3 · 2016-07-04T19:18:24.000Z

Ok for the proposal ;)

Answer 4 · 2016-07-04T23:02:26.000Z

JavaScript unicode-regex support isn't great, but there's an MIT licensed library named xregexp that seems to be the standard for this kind of thing, and a single file with the unicode character ranges.

For now, cuerdas targeting JS will be lacking compared to it's JVM counterpart, but this could even the two out.

Thoughts?

Answer 5 · 2016-07-07T16:08:10.000Z

hmm i'm not contrary to include it, seems like very well done library.
Additionally to internal use we can expose its funcionality to the user, because many things that it exposes looks pretty cool and is not available on plain regexes...

Answer 6 · 2016-07-09T02:19:06.000Z

Yeah, some of its features in cuerdas would be great!

Just as a heads up, I won't be putting in any work on this issue for at least a few days. Some time next week, i'll plan to use pull in some unicode support from xregexp to help cuerdas target both the JVM and JS equally. After that (and #20) i'll be free for some attention to fancy regexes.

Answer 7 · 2016-07-26T04:32:01.000Z

For bringing in xregexp, would you prefer it be through cljsjs, :foreign-libs, or something else?

Answer 8 · 2016-07-26T11:30:50.000Z

I'm already working on include xregexp

Answer 9 · 2016-08-29T18:38:14.000Z

XRegExp is included and many fixes in this are are done in master. This issue can be considered done. If something is missing, we can just open a PR o more little issue for fix a particular case.