martinblech/xmltodict

Leading spaces in values are automatically stripped?

MerlijnWajer opened this issue · 3 comments

I ran into a problem parsing this file with xmltodict: https://archive.org/download/janus-34-scan-zapman/janus-34-scan-zapman_files.xml

The value of 'original' has it's leading space stripped, it should be ' JANUS 34_Scan Zapman_chocr.html.gz', but it is turned into 'JANUS 34_Scan Zapman_chocr.html.gz'

This is probably caused by the commit from this issue: #15

Given the above commit, it is not clear to me if there is any way to keep spaces inside an element in XML. Is there a way to disable this behaviour?

Here's the relevant part from the file linked above:

<file name=" JANUS 34_Scan Zapman_hocr.html" source="derivative">
<hocr_char_to_word_module_version>1.1.0</hocr_char_to_word_module_version>
<hocr_char_to_word_hocr_version>1.1.15</hocr_char_to_word_hocr_version>
<ocr_parameters>-l fra</ocr_parameters>
<ocr_module_version>0.0.18</ocr_module_version>
<ocr_detected_script>Latin</ocr_detected_script>
<ocr_detected_script_conf>0.4311</ocr_detected_script_conf>
<ocr_detected_lang>fr</ocr_detected_lang>
<ocr_detected_lang_conf>1.0000</ocr_detected_lang_conf>
<format>hOCR</format>
<original> JANUS 34_Scan Zapman_chocr.html.gz</original>
<mtime>1664638619</mtime>
<size>2140105</size>
<md5>1596964e7b6e5aee5e6faedc6d3cb47b</md5>
<crc32>b0c6226b</crc32>
<sha1>07eca05572e97b5abb66fcba4252956ada5f7b10</sha1>
</file>

Ah, I think I figured out the solution, there is a strip_whitespace=False argument that can be passed to the kwargs of parse -- it just wasn't documented.

Leaving this issue open for any discussion about the default behaviour, silently truncating data seems problematic to me.

@MerlijnWajer Thank you for bringing this up.