funcool/cuerdas

string identification

atstp opened this issue · 6 comments

atstp commented

use case

When I was putting together #20, I reached for numeric? to find out that it only matched digits (granted, it's clear in the docs).

tiny fix

When it comes to identifying strings, a number? function would be a nice tool, one that it would match "55" as well as "-55.0" with the idea that this would mostly* work:

(= (cuerdas.core/number? "-55.0")
   (clojure.core/number? (clojure.edn/read-string "-55.0")))

(*) i don't have strong stance on 1N, 1M, and 1/3

larger changes

This introduces awkwardness with numeric?, alpha-numeric?, and, by relation, alpha?. (breaking change suggestion) In my opinion, they could be scrapped in favor of digits?, letters?, and letters-and-digits?, though the later seems less useful and could probably just be dropped entirely.

At least based on a gut reaction, the familiar alpha, numeric, would leave a hole if they weren't around. To fill that gap, posix- prefixed functions would be serve well: posix-alphas?, posix-alnums?, posix-blanks?, etc. would all provide the stable, well-known role that they exist for, leaving cuerdas to provide concisely-named, modern equivalents like it already does with words, which breaks tradition by including "-".

A bonus with posix- prefixed functions is that the explicit "old" naming makes cuerdas' modern equivalents expected.

Here's some tests that would pass what i'm suggesting:

(are [tst val] #(%1 %2)
  number?  "-99.8"
  digits?  "99"
  letters? "abcde"
  word?    "this-that"
  word?    "This_that"

  posix-alphas? "aBc"
  posix-digits? "12345"
  posix-alunms? "abc123"
  posix-word?   "This_That"

I'm up for putting most (all) of the work in on this if it's likely to get accepted. Thanks!

I agree with almost all proposed changes, including breaking changes (maybe for some functions that are not in conflict mark as deprecated first).

This is my list of doubts:

  • why the posix- prefixed functions, I'm clearly understand the purpose of them
  • I miss a function that replaces alpha-numeric? because is pretty useful.

Feel free to work on it, I glad to have new and better predicates for identify strings.

atstp commented

the posix- prefix

You're right, posix is a bad prefix. Sorry to cloud the idea with a bad name. Generally <prefix>- would provide matchers that match ascii/unicode-latin

  • <prefix>-<just alpha>?: `[a-zA-Z] in comparison to generic unicode "letters"
  • <prefix>-<alpha and numeric>?: characters like [a-zA-Z0-9]/[:alnum:] for product numbers and such

and

  • <prefix>-word?: the unfortunately common ([a-zA-Z0-9_], \w, or [:word:]) meaning of "word"

While they aren't going to change the world, they fill a common need. simple-, latin-, or traditional- could work as well. Generally, they serve enough purpose to be useful, but not enough to nudge out the more useful word? and letter?.

This would allow letter? and word? to adapt as java and javascript support unicode better, (perhaps for another issue) but letter? and word? could match based on letters for the locale or any unicode language.

alpha-numeric? replacement

yeah, would letters-and-numbers? work? should it respect locale?

Aha, seems like I start understanding. If I understand it well, your proposal is to have the "standard" or "traditional" behavior prefixed with posix or traditional and unprefixed more better that the traditional approach.

I'm pretty convinced with the approach, feel free to make a PR with that and we will make the final adjustments on the working code. Thank you very much for taking care of this!

We only need to choice the final prefix, for now use the posix that now as I understand the motivation behind the name it make sense for me.

atstp commented

great, glad to hear it!

After analyzing the situation, I have done a little bit different approach. I have't done the renames because most of the stuff are obvious:

alpha? is alpha independently of unicode. So it can be called as is. alnum? is always alpha + num so it does not need to be renamed. But other functions such as letters? and word? they are unicode aware without any specific prefix.

In fact word? is a unicode aware alternative to alnum?.

With that changes I can consider this issue fixed ;)