martinblech/xmltodict

trailing space is stripped from CDATA

esnosy opened this issue · 2 comments

esnosy commented

for example this xml

<parent>
        <element><![CDATA[data    ]]></element>  
</parent>

let's parse it

import xmltodict

xml = """
<parent>
        <element><![CDATA[data    ]]></element>
</parent>
"""
parsed_xml = xmltodict.parse(xml)
print(repr(parsed_xml['parent']['element']))

result:
'data'

expected result:
'data '

untangle library is able to correctly parse it: https://pypi.org/project/untangle/

import untangle

obj = untangle.parse(xml)
print(repr(obj.parent.element.cdata))

result
'data '

yes i have same problem

So, I encountered the same problem. The solution is to pass "strip_whitespace=False" as an optional argument to xmltodict.parse(). So, for the above example, this should do the trick:

import xmltodict

xml = """
<parent>
        <element><![CDATA[data    ]]></element>
</parent>
"""
parsed_xml = xmltodict.parse(xml, strip_whitespace=False)
print(repr(parsed_xml['parent']['element']))

I discovered this after turning on debugging mode and stepping through the code. It would be nice if xmltodict's user documentation was more robust, so users don't have dig into the code to investigate this in the first place.