Case-insensitive parsing and validation.

Question

Case-insensitive parsing and validation.

Opened this issue 6 years ago · 30 comments

robertodr commented 6 years ago

As pointed out by @stigrj it would be good to not force case-sensitivity upon the users. I see some options:

Do it before we parse. This would mean the programmer using the library has to normalize the case of the input file and the validation template has to adhere to this choice. That is, if normalization is uppercase (lowercase), then validation template use uppercase (lowercase).
As point 1, but we add an option to the api.lex (and expose it in the CLI). Something like --case upper (--case lower)
We let the grammar take care of case normalization, by adding a parse action to the various tokens. In case we change the parsing library (from pyparsing to lark, for example) this will not carry through.
We do case normalization at the validation level. I think this is the most invasive option of all.

stigrj commented 6 years ago

👍

Answer 1 · 2019-03-27T14:44:30.000Z

I agree with 2.

Answer 2 · 2019-03-27T17:14:31.000Z

Noting this here, as it might be helpful: https://docs.python.org/3/library/stdtypes.html#str.casefold

Answer 3 · 2019-03-28T16:05:06.000Z

Does it make sense to you that the case of quoted strings should be preserved? The way Getkw works, you can have quoted and unquoted strings. The latter can only look like valid C user tokens (first character can be an underscore or an alphabetic chars, followed by more underscores and alphanumeric chars) whereas the former can be almost anything (unicode too, I think) Most importantly, quoted string can contain slashes, so that this is the format one should use to specify file paths. I'd expect file paths to always be intended to be case-sensitive, hence my question.

Answer 4 · 2019-03-28T16:56:31.000Z

I agree that changing case of paths is dangerous. Also until now I did not realize that we wanted to possibly normalize everything - somehow I imagined that we only normalize keys but not values.

Answer 5 · 2019-03-28T16:58:19.000Z

Yeah, it's thorny. Are all of these acceptable?

Functional = B3LYP
functional = b3lyp
FunCtioNAL = b3LYP

Answer 6 · 2019-03-28T17:03:06.000Z

Every time I saw case insensitivity as user (e.g. Fortran, old OS X file system (?)), I thought: "wow what a terrible idea!" - but I guess we can offer the possibility.

Answer 7 · 2019-03-28T17:06:09.000Z

How will this work:

datapath1 = /home/user/doNotNormalizeMe
datapath2 = "/home/user/doNotNormalizeMe"

Answer 8 · 2019-03-28T17:08:49.000Z

The current grammar will not be able to parse the former.

Answer 9 · 2019-03-28T17:09:59.000Z

How about we only offer ignore-casing of keys? But we preserve values?

Answer 10 · 2019-03-28T17:11:21.000Z

How about this

  a_bool = True
  another_bool = false

Is it defined in the grammar what to accept?

Answer 11 · 2019-03-28T17:13:25.000Z

@stigrj All of these are valid for booleans as caseless literals already.

truthy = ["TRUE", "ON", "YES", "Y"]
"""List[str]: List of true-like values."""
falsey = ["FALSE", "OFF", "NO", "N"]
"""List[str]: List of false-like values."""

@bast I think I would still like b3lyp to be as valid as B3LYP though. Regarding floating point numbers, only caseless E is recognized now. D can be arranged too, but I'd rather not encourage using D...

Answer 12 · 2019-03-28T17:15:26.000Z

Good point about the booleans, this is a frequent papercut. Also scientific notation floats should allow ignore-case "D" and "E".

Answer 13 · 2019-03-28T17:21:46.000Z

Yes, b3lyp example is convincing. Also I would like to type cc-pvdz.
Some half baked ideas:

we allow to ignore ignore case in the template for some sensitive nodes (sounds not elegant)
we add "path" to the set of allowed types and we do not touch "paths"

Answer 14 · 2019-03-28T17:23:30.000Z

Yes, I think the most realistic use case is to allow both

functional = b3lyp
functional = B3LYP

The keywords can always be set to follow a convention, like in MRChem:

sections: capitalized camel-case
keywords: lowercase with underscores

It's not that easy with values

Answer 15 · 2019-03-28T17:26:46.000Z

About the "NO" interpreted as falsey - we had a nasty bug in a web thingy where we were parsing country codes for Nordic projects and NO (Norway) was interpreted as false :-)

Answer 16 · 2019-03-28T17:35:53.000Z

I think paths are the only case where we would never want to modify the case of the string. Are there any other examples? I feel adding a Path and List[Path] type could be the elegant and easy way out of this:

everything case-sensitive by default (except booleans and floats in scientific notation)
case-sensitive with case normalization selected by the user. Paths must be given as quoted strings, types must be declared accordingly. Case normalization is only done on values that are strings At what level I still have to figure out. Probably needs two passes: one before going into the lexer proper, we read the file and normalize skipping all singly and doubly quoted strings, and another when fixing defaults, to make sure all strings are normalized.

@bast I was bitten by the same two days ago (in the randomized testing of lists of unquoted strings) Should we only allow true and false? I was trying to reproduce the original grammar: https://github.com/dev-cafe/libgetkw/blob/master/Python/getkw.py#L757

Answer 17 · 2019-03-28T17:37:53.000Z

Bonus of having the Path type. Predicates to check that a file already exists are trivial: from pathlib import Path; Path(value).exists()

Answer 18 · 2019-03-28T17:40:20.000Z

How about adding actions to the keywords similar to predicates, so that you can get normalized values where you want them, like functionals and basis sets

keyword:
  - name: functional
    type: str
    predicates:
      - 'len(value) < 20'
    actions:
      - 'value = value.lower()'

A Path type still sounds like a good idea!

Answer 19 · 2019-03-28T17:43:42.000Z

So much power... I like the idea, but I don't think it scales well (I need to do it for all keywords) I need to think about it more.

Answer 20 · 2019-03-28T18:33:37.000Z

After some more thinking I still like the idea of a Path type. Strings and paths are really distinct types.

As to "NO": I would not miss yes/no. I am still unsure whether I would miss on/off.

Answer 21 · 2019-03-28T19:05:10.000Z

Another thing that I would not like to case-normalize are checksums/hashes.

Answer 22 · 2019-03-28T19:16:38.000Z

Let's split the str type into two: case-sensitive and case-insensitive. The latter would have the case normalization action on its value "embedded". The Path type is a valuable addition regardless of this discussion. I'll open a new issue (and possibly have a PR ready later this evening)

Answer 23 · 2019-03-29T09:15:54.000Z

I like the suggestion of having two str types. This makes everything less surprising.

Answer 24 · 2019-03-29T14:12:19.000Z

Name suggestions for the types? istr and sstr?

Answer 25 · 2019-03-29T14:19:13.000Z

How about str and str_ignorecase?

Answer 26 · 2019-03-30T02:29:30.000Z

I have written some preliminary design docs. Let me know if you agree. See also my comment on #55 as to why, contrary to my initial enthusiam, we do not need/want a path type.

Case-sensitivity is a sensitive issue. It is desirable to have the following two examples be equally valid:

$ functional = b3lyp
$ Functional = B3LYP

We can ensure case-insensitive comparisions by preliminarly normalizing the case of the input. The programmer would have the choice of normalizing to uppercase or lowercase. Hence, the above examples would both decay to either:

$ functional = b3lyp

when normalizing to lowercase, or:

$ FUNCTIONAL = B3LYP

when normalizing to uppercase. Note how normalization happens on the left-hand (section and keyword labels) and the right-hand side (values).
This means that the validation specifications in the template.yml file need to conform to whatever the chosen normalization is, both for keyword/section names and possible string defaults. Note that for booleans case normalization is not problematic: their type will be coerced upon reading in, be it through tokenization of the input file or loading of the template.yml.

There are however cases where normalizing the case is not desirable. Two notable examples are filesystem paths and checksums. For these cases, the case from input needs to be respected. Once again, note that the case of default values in the template.yml is completely up to the programmer. No normalization will ever be performed by the library.

We offer two string types:

Case sensitive: str. No case normalization will be performed for values of this type.
Case insensitive: str_ignorecase. For values of this type, the case
will be normalized - either upper or lower - according to the request of the
programmer. If no case normalization is requested, this string type decays to str.

Answer 27 · 2019-03-30T08:53:37.000Z

Do we really need/want insensitive keywords and sections? Wouldn't it be more sensible to only normalize the str_ignorecase values and keep everything else, instead of normalizing everything and keep the sensitive str? It feels a bit strange that if you want to accept case insensitive strings from the user you have to declare it as str_ignorecase and at the same time make all section and keyword names insensitive.

How about three types:

str: case sensitive
str_lower: value will be normalized to lower
str_upper: value will be normalized to upper

Answer 28 · 2019-03-30T09:47:15.000Z

Thanks for convincing arguments about path. Also I agree that we should perhaps only allow ignorecase/lower/upper on the values but not the keys. The reason is that later the code wants to fish out the values using well defined keys and perhaps we don't want another degree of freedom there.
I like the suggestion with str, str_lower, str_upper - this way the code knows what they get and do not need to normalize inside again just to be sure.

Answer 29 · 2019-07-18T11:51:43.000Z

Discussion with Stig and Roberto:

keys (in other words keywords and section names) are normalized to lowercase and the dictionary is then accessed with lowercase keys
values: we postpone this decision until we decide about actions later since actions would allow lower() or upper()