tobywf/xml_dataclasses

XML comments are reported as undeclared child elements

masak opened this issue · 3 comments

masak commented

With this example XML:

<?xml version="1.0"?>
<person version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:person">
    <name>George Washington</name>
    <!-- I am a comment -->
    <phone>18123456789</phone>

    <address
        street-address="19 Pennsylvania Avenue"
        city="Austin"
        state="Texas"
        country="USA"
    />
</person>

Reading it with the appropriate xml dataclass declarations and the following loading code:

parser = etree.XMLParser(remove_blank_text=True)
lxml_el_in = etree.parse("example.xml", parser).getroot()
example = load(Person, lxml_el_in, "person")

Produces this runtime error:

Traceback (most recent call last):
  File "example-test.py", line 73, in <module>
    example = load(Person, lxml_el_in, "person")
  File "C:\Users\m00537252\AppData\Local\Programs\Python\Python38-32\lib\site-pa
ckages\xml_dataclasses\serde.py", line 155, in load
    child_values = _load_children(cls, el)
  File "C:\Users\m00537252\AppData\Local\Programs\Python\Python38-32\lib\site-pa
ckages\xml_dataclasses\serde.py", line 126, in _load_children
    raise ValueError(f"Found undeclared child elements on '{el.tag}': {readable}
")
ValueError: Found undeclared child elements on '{urn:oasis:names:tc:opendocument
:xmlns:person}person': '<cyfunction Comment at 0x01B9EB88>'

Sneaking a peek at that place in the module code, it seems that it counts the comment among its child elements, and then flags an error because that child element was not processed. (I would expect a comment not to need processing.)

On trying to skim the etree documentation for an option to exclude XML comments from parsing, I was left with the impression that they shouldn't really have been included in the first place.

That is a result of the decision to make "deserialisation" strict (as detailed in the README), and not considering comments as special. Luckily, just as with whitespace (remove_blank_text), lxml can easily strip comments with the remove_comments option. I like that approach for both simplicity (lxml is good at parsing) and explicitness (making it obvious to readers of the code that comments are discarded). I hope that's an easy solution? Admittedly, I should probably call that out in the README...

masak commented

Thank you for quick feedback, @tobywf. 👍 I forgot to add that I'm a satisfied user of xml-dataclasses, and have gotten quite a lot of mileage out of it, even though it's still in early development.

Luckily, just as with whitespace (remove_blank_text), lxml can easily strip comments with the remove_comments option.

This worked. Even after actively looking for it, I somehow missed that option — should have thought to look in lxml.

Admittedly, I should probably call that out in the README...

I took a stab at it #5 — see if the phrasing works for you.

That is a result of the decision to make "deserialisation" strict (as detailed in the README), and not considering comments as special.

I think I understand this design rationale. That one is yours to make, and I won't argue with it.

One lingering concern is that the error message is technically wrong: it says "Found undeclared child elements", but comments are not elements. (They are Nodes, and part of the node tree, but Comment is not a subtype of Element.) Mis-classifying comments as elements makes it seem like a mistake that the code hiccups on comments (even taking the above rationale into account).

I'm not against lenient loading, as long as it's expected. I just started with strict, and it's a long outstanding issue (#3) - I haven't had much time or need to work on it. I would like to add options to load, since it could also unlock type conversions in the future (so not everything has to be a string).

One lingering concern is that the error message is technically wrong: it says "Found undeclared child elements", but comments are not elements. (They are Nodes, and part of the node tree, but Comment is not a subtype of Element.) Mis-classifying comments as elements makes it seem like a mistake that the code hiccups on comments (even taking the above rationale into account).

That's a fair point. It currently get swept up in the same logic by accident.