PCMDI/cmip6-cmor-tables

Regex in CMIP6_CV.json to test `*_index` attributes

neumannd opened this issue · 1 comments

The CMIP6_CV.json contains regular expressions to test the global attributes physics_index, initialization_index, forcing_index and realization_index for correctness. These global attributes should be integers (CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and CV’s). Therefore, the CMOR PrePARE.py script) just checks the type of these attributes and does not use the regex of CMIP6_CV.json.

However, the regular expression provided in CMIP6_CV.json seems to check for an arbitrary number of [ in front and ] behind the integer. I don't understand, why this is done. This seems to contradict CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and CV’s.

evaluation of the regular expression

In the CMIP6_CV.json the regex for testing the *_index attributes is written as:

^\\[\\{0,\\}[[:digit:]]\\{1,\\}\\]\\{0,\\}$

The first \ of each \\ escapes the second \. That's clear. Without escapes we have

^\[\{0,\}[[:digit:]]\{1,\}\]\{0,\}$

I assume that we have a POSIX Basic Regular Expression. That means that \[ and \] are taken literally. \{n,\} are intepreted as: "the sign/character/number left of this expression may appear n to infinite times". The ^ and $ are the beginning and end of a line, respectively. Thus, we have

^                 : beginning of the line
\[\{0,\}          : `[` appears zero to infinite times
[[:digit:]]\{1,\} : a digit between `0` and `9` appears one to infinite times
\]\{0,\}          : `]` appears zero to infinite times
$                 : end of the line

These values would be captured by the regular expression:

1
123
42
53253262

But also these values would be captured by the regular expression:

[1435]
[[123]]
[[123]
[123]]
[123]]]]]]]]]

I would have expected this regular expression

^[[:digit:]]\\{1,\\}$

or

^[[0-9]]\\{1,\\}$
^[[:digit:]]+$
^[[0-9]]+$

Or is this something that should be mentioned in https://github.com/PCMDI/cmor/issues/256?