trailing space is stripped from CDATA

Question

trailing space is stripped from CDATA

esnosy opened this issue a year ago · 2 comments

for example this xml

<parent>
        <element><![CDATA[data    ]]></element>  
</parent>

let's parse it

import xmltodict

xml = """
<parent>
        <element><![CDATA[data    ]]></element>
</parent>
"""
parsed_xml = xmltodict.parse(xml)
print(repr(parsed_xml['parent']['element']))

result:
'data'

expected result:
'data '

untangle library is able to correctly parse it: https://pypi.org/project/untangle/

import untangle

obj = untangle.parse(xml)
print(repr(obj.parent.element.cdata))

result
'data '

Answer 1 · 2023-11-14T13:13:05.000Z

yes i have same problem

Answer 2 · 2024-08-01T13:39:05.000Z

So, I encountered the same problem. The solution is to pass "strip_whitespace=False" as an optional argument to xmltodict.parse(). So, for the above example, this should do the trick:

import xmltodict

xml = """
<parent>
        <element><![CDATA[data    ]]></element>
</parent>
"""
parsed_xml = xmltodict.parse(xml, strip_whitespace=False)
print(repr(parsed_xml['parent']['element']))

I discovered this after turning on debugging mode and stepping through the code. It would be nice if xmltodict's user documentation was more robust, so users don't have dig into the code to investigate this in the first place.