Leading spaces in values are automatically stripped?
MerlijnWajer opened this issue · 3 comments
I ran into a problem parsing this file with xmltodict: https://archive.org/download/janus-34-scan-zapman/janus-34-scan-zapman_files.xml
The value of 'original' has it's leading space stripped, it should be ' JANUS 34_Scan Zapman_chocr.html.gz'
, but it is turned into 'JANUS 34_Scan Zapman_chocr.html.gz'
This is probably caused by the commit from this issue: #15
Given the above commit, it is not clear to me if there is any way to keep spaces inside an element in XML. Is there a way to disable this behaviour?
Here's the relevant part from the file linked above:
<file name=" JANUS 34_Scan Zapman_hocr.html" source="derivative">
<hocr_char_to_word_module_version>1.1.0</hocr_char_to_word_module_version>
<hocr_char_to_word_hocr_version>1.1.15</hocr_char_to_word_hocr_version>
<ocr_parameters>-l fra</ocr_parameters>
<ocr_module_version>0.0.18</ocr_module_version>
<ocr_detected_script>Latin</ocr_detected_script>
<ocr_detected_script_conf>0.4311</ocr_detected_script_conf>
<ocr_detected_lang>fr</ocr_detected_lang>
<ocr_detected_lang_conf>1.0000</ocr_detected_lang_conf>
<format>hOCR</format>
<original> JANUS 34_Scan Zapman_chocr.html.gz</original>
<mtime>1664638619</mtime>
<size>2140105</size>
<md5>1596964e7b6e5aee5e6faedc6d3cb47b</md5>
<crc32>b0c6226b</crc32>
<sha1>07eca05572e97b5abb66fcba4252956ada5f7b10</sha1>
</file>
Ah, I think I figured out the solution, there is a strip_whitespace=False
argument that can be passed to the kwargs
of parse
-- it just wasn't documented.
Leaving this issue open for any discussion about the default behaviour, silently truncating data seems problematic to me.
@MerlijnWajer Thank you for bringing this up.