Case-insensitive parsing and validation.
Opened this issue ยท 30 comments
As pointed out by @stigrj it would be good to not force case-sensitivity upon the users. I see some options:
- Do it before we parse. This would mean the programmer using the library has to normalize the case of the input file and the validation template has to adhere to this choice. That is, if normalization is uppercase (lowercase), then validation template use uppercase (lowercase).
- As point 1, but we add an option to the
api.lex
(and expose it in the CLI). Something like--case upper
(--case lower
) - We let the grammar take care of case normalization, by adding a parse action to the various tokens. In case we change the parsing library (from pyparsing to lark, for example) this will not carry through.
- We do case normalization at the validation level. I think this is the most invasive option of all.
- ๐
I agree with 2.
Noting this here, as it might be helpful: https://docs.python.org/3/library/stdtypes.html#str.casefold
Does it make sense to you that the case of quoted strings should be preserved? The way Getkw works, you can have quoted and unquoted strings. The latter can only look like valid C user tokens (first character can be an underscore or an alphabetic chars, followed by more underscores and alphanumeric chars) whereas the former can be almost anything (unicode too, I think) Most importantly, quoted string can contain slashes, so that this is the format one should use to specify file paths. I'd expect file paths to always be intended to be case-sensitive, hence my question.
I agree that changing case of paths is dangerous. Also until now I did not realize that we wanted to possibly normalize everything - somehow I imagined that we only normalize keys but not values.
Yeah, it's thorny. Are all of these acceptable?
Functional = B3LYP
functional = b3lyp
FunCtioNAL = b3LYP
Every time I saw case insensitivity as user (e.g. Fortran, old OS X file system (?)), I thought: "wow what a terrible idea!" - but I guess we can offer the possibility.
How will this work:
datapath1 = /home/user/doNotNormalizeMe
datapath2 = "/home/user/doNotNormalizeMe"
The current grammar will not be able to parse the former.
How about we only offer ignore-casing of keys? But we preserve values?
How about this
a_bool = True
another_bool = false
Is it defined in the grammar what to accept?
@stigrj All of these are valid for booleans as caseless literals already.
truthy = ["TRUE", "ON", "YES", "Y"]
"""List[str]: List of true-like values."""
falsey = ["FALSE", "OFF", "NO", "N"]
"""List[str]: List of false-like values."""
@bast I think I would still like b3lyp
to be as valid as B3LYP
though. Regarding floating point numbers, only caseless E
is recognized now. D
can be arranged too, but I'd rather not encourage using D
...
Good point about the booleans, this is a frequent papercut. Also scientific notation floats should allow ignore-case "D" and "E".
Yes, b3lyp example is convincing. Also I would like to type cc-pvdz.
Some half baked ideas:
- we allow to ignore ignore case in the template for some sensitive nodes (sounds not elegant)
- we add "path" to the set of allowed types and we do not touch "paths"
Yes, I think the most realistic use case is to allow both
functional = b3lyp
functional = B3LYP
The keywords can always be set to follow a convention, like in MRChem:
- sections: capitalized camel-case
- keywords: lowercase with underscores
It's not that easy with values
About the "NO" interpreted as falsey - we had a nasty bug in a web thingy where we were parsing country codes for Nordic projects and NO (Norway) was interpreted as false :-)
I think paths are the only case where we would never want to modify the case of the string. Are there any other examples? I feel adding a Path
and List[Path]
type could be the elegant and easy way out of this:
- everything case-sensitive by default (except booleans and floats in scientific notation)
- case-sensitive with case normalization selected by the user. Paths must be given as quoted strings, types must be declared accordingly. Case normalization is only done on values that are strings At what level I still have to figure out. Probably needs two passes: one before going into the lexer proper, we read the file and normalize skipping all singly and doubly quoted strings, and another when fixing defaults, to make sure all strings are normalized.
@bast I was bitten by the same two days ago (in the randomized testing of lists of unquoted strings) Should we only allow true and false? I was trying to reproduce the original grammar: https://github.com/dev-cafe/libgetkw/blob/master/Python/getkw.py#L757
Bonus of having the Path
type. Predicates to check that a file already exists are trivial: from pathlib import Path; Path(value).exists()
How about adding actions to the keywords similar to predicates, so that you can get normalized values where you want them, like functionals and basis sets
keyword:
- name: functional
type: str
predicates:
- 'len(value) < 20'
actions:
- 'value = value.lower()'
A Path
type still sounds like a good idea!
So much power... I like the idea, but I don't think it scales well (I need to do it for all keywords) I need to think about it more.
After some more thinking I still like the idea of a Path
type. Strings and paths are really distinct types.
As to "NO": I would not miss yes/no. I am still unsure whether I would miss on/off.
Another thing that I would not like to case-normalize are checksums/hashes.
Let's split the str
type into two: case-sensitive and case-insensitive. The latter would have the case normalization action on its value "embedded". The Path
type is a valuable addition regardless of this discussion. I'll open a new issue (and possibly have a PR ready later this evening)
I like the suggestion of having two str
types. This makes everything less surprising.
Name suggestions for the types? istr
and sstr
?
How about str
and str_ignorecase
?
I have written some preliminary design docs. Let me know if you agree. See also my comment on #55 as to why, contrary to my initial enthusiam, we do not need/want a path type.
Case-sensitivity is a sensitive issue. It is desirable to have the following two examples be equally valid:
$ functional = b3lyp
$ Functional = B3LYP
We can ensure case-insensitive comparisions by preliminarly normalizing the case of the input. The programmer would have the choice of normalizing to uppercase or lowercase. Hence, the above examples would both decay to either:
$ functional = b3lyp
when normalizing to lowercase, or:
$ FUNCTIONAL = B3LYP
when normalizing to uppercase. Note how normalization happens on the left-hand (section and keyword labels) and the right-hand side (values).
This means that the validation specifications in the template.yml
file need to conform to whatever the chosen normalization is, both for keyword/section names and possible string defaults. Note that for booleans case normalization is not problematic: their type will be coerced upon reading in, be it through tokenization of the input file or loading of the template.yml
.
There are however cases where normalizing the case is not desirable. Two notable examples are filesystem paths and checksums. For these cases, the case from input needs to be respected. Once again, note that the case of default values in the template.yml
is completely up to the programmer. No normalization will ever be performed by the library.
We offer two string types:
- Case sensitive:
str
. No case normalization will be performed for values of this type. - Case insensitive:
str_ignorecase
. For values of this type, the case
will be normalized - either upper or lower - according to the request of the
programmer. If no case normalization is requested, this string type decays tostr
.
Do we really need/want insensitive keywords and sections? Wouldn't it be more sensible to only normalize the str_ignorecase
values and keep everything else, instead of normalizing everything and keep the sensitive str
? It feels a bit strange that if you want to accept case insensitive strings from the user you have to declare it as str_ignorecase
and at the same time make all section and keyword names insensitive.
How about three types:
str
: case sensitivestr_lower
: value will be normalized to lowerstr_upper
: value will be normalized to upper
Thanks for convincing arguments about path
. Also I agree that we should perhaps only allow ignorecase/lower/upper on the values but not the keys. The reason is that later the code wants to fish out the values using well defined keys and perhaps we don't want another degree of freedom there.
I like the suggestion with str
, str_lower
, str_upper
- this way the code knows what they get and do not need to normalize inside again just to be sure.
Discussion with Stig and Roberto:
- keys (in other words keywords and section names) are normalized to lowercase and the dictionary is then accessed with lowercase keys
- values: we postpone this decision until we decide about actions later since actions would allow
lower()
orupper()