funcool/cuerdas

respecting locale and unicode charsets

atstp opened this issue · 9 comments

atstp commented

Just a central place to handle the stuff from #22 and #21: respecting/handing locales and charsets.
#21 settled on (at least) some prefixed functions with support for specific types.

This is about handling generic functions like lower, upper, word?, words, letters?, caseless=, etc.

It looks like there are 4 approaches:

  • use naive host functions - it's simple and easy to build
  • use the most capable host function- it'll grow as the host language evolves
  • hard-code character ranges - too much work and too little payoff (at least for me for now)
  • set ^:dynamic defaults - people can get crazy if they want (or if they just know more than me)

my personal preference is using the most capable host functions. Making the defaults dynamic would be okay too, but i'm not pushing for it.

After thinking about this, my unique concern to be aware of the locale in the "use the most capable host function" approach, is that that host functions uses the host locale and not one specified by user.

I imagine the mostly common situation where the server is in USA and client is in Turkey, the host uses the USA locale, so the functions will be limited to the USA stuff and can't be aware of a specific locale. But, implement all locale stuff in cuerdas I think is too much (maybe I'm wrong and it can be done easily but I doubt about that). So, I'm not clearly see advantages on using host locale aware functions when they solve only small portion of the problem.

I'm wrong?

atstp commented

What would you think about having a ^:dynamic var named something like *use-locale* (false by default)?

With this, there would be "safe" behaviour that we could build quickly now, and in the future, it could be rebound to either true or the host's specific locale, like Locale.US for jvm users?

this would be easy to implement right away, and set things up for a smooth(ish) transition to better unicode support, which will get better in the next few years.

Ok for the proposal ;)

atstp commented

JavaScript unicode-regex support isn't great, but there's an MIT licensed library named xregexp that seems to be the standard for this kind of thing, and a single file with the unicode character ranges.

For now, cuerdas targeting JS will be lacking compared to it's JVM counterpart, but this could even the two out.

Thoughts?

hmm i'm not contrary to include it, seems like very well done library.
Additionally to internal use we can expose its funcionality to the user, because many things that it exposes looks pretty cool and is not available on plain regexes...

atstp commented

Yeah, some of its features in cuerdas would be great!

Just as a heads up, I won't be putting in any work on this issue for at least a few days. Some time next week, i'll plan to use pull in some unicode support from xregexp to help cuerdas target both the JVM and JS equally. After that (and #20) i'll be free for some attention to fancy regexes.

atstp commented

For bringing in xregexp, would you prefer it be through cljsjs, :foreign-libs, or something else?

I'm already working on include xregexp

XRegExp is included and many fixes in this are are done in master. This issue can be considered done. If something is missing, we can just open a PR o more little issue for fix a particular case.