Whitespace and newlines in coordinate sections between content and $end is not stripped
Closed this issue · 1 comments
- parselglossy version: 0.7.0
- Python version: 3.9.2
- Operating System: MacOS
- Context: MRChem
Description
Upon parsing coordinate sections such as those used when specifying atomic coordinates
$coords
He 0.0 0.0 0.0
$end
or solvation cavity spheres
$spheres
0.0 0.0 0.0 4.0
$end
the parser ignores all whitespace and newlines between $start
and the actual content, but does not ignore whitespace and newlines between the content and $end
. As a result, the following sections are not parsed identically:
$coords
He 0.0 0.0 0.0
$end
$coords
He 0.0 0.0 0.0
$end
$coords
He 0.0 0.0 0.0$end
These result in the following strings, respectively
"He 0.0 0.0 0.0\n"
"He 0.0 0.0 0.0\n "
"He 0.0 0.0 0.0"
The expected output for all is (at least to me) the last one. This could become a bit problematic when the user indents these sections (very common to do), and some type of sanity checking is performed on the data. Consider the following
lines = user_dict['Molecule']['coords'].splitlines()
print(lines)
results in for the three examples
["He 0.0 0.0 0.0"]
["He 0.0 0.0 0.0", " "]
["He 0.0 0.0 0.0"]
The middle example has resulted in an empty list element. strip()
ing beforesplit()
ing fixes the issue, but parselglossy
should probably strip all extra whitespace under the hood.
TL;DR this is so by design. The "contract" between parselglossy
and its users is that parselglossy
won't touch what's between $<name>
/$end
.
This is where and how the parsing token is defined for those kinds of parameters: https://github.com/dev-cafe/parselglossy/blob/master/parselglossy/grammars/atoms.py#L89-L95
I cannot find the issue where we discussed this (it might be in a thread on some Zulip channel) but the $<name>
/$end
parameters are by design escape hatches to pass untyped information verbatim past the input parser and into the final dictionary. The idea was to keep the grammar simple and avoid type-checking for things that the developers using parselglossy
know how to read better than we could. Preserving indentation might be one of the use cases for this: it is a weird requirement in the context of parsing molecular geometries, but it might be essential somewhere else.