trailing space is stripped from CDATA
esnosy opened this issue · 2 comments
esnosy commented
for example this xml
<parent>
<element><![CDATA[data ]]></element>
</parent>
let's parse it
import xmltodict
xml = """
<parent>
<element><![CDATA[data ]]></element>
</parent>
"""
parsed_xml = xmltodict.parse(xml)
print(repr(parsed_xml['parent']['element']))
result:
'data'
expected result:
'data '
untangle library is able to correctly parse it: https://pypi.org/project/untangle/
import untangle
obj = untangle.parse(xml)
print(repr(obj.parent.element.cdata))
result
'data '
ibrahelsheikh commented
yes i have same problem
afbwilliam commented
So, I encountered the same problem. The solution is to pass "strip_whitespace=False" as an optional argument to xmltodict.parse(). So, for the above example, this should do the trick:
import xmltodict
xml = """
<parent>
<element><![CDATA[data ]]></element>
</parent>
"""
parsed_xml = xmltodict.parse(xml, strip_whitespace=False)
print(repr(parsed_xml['parent']['element']))
I discovered this after turning on debugging mode and stepping through the code. It would be nice if xmltodict's user documentation was more robust, so users don't have dig into the code to investigate this in the first place.