jsfenfen/990-xml-reader

xmltodict error on windows / anaconda

Closed this issue · 2 comments

xmltodict on windows is choking on the xml formatting the IRS uses on the xml files. This has been reproduced on windows / anaconda, unclear how many versions are affected.

import xmltodict
 
filepath = r"c:\.....    anaconda3\lib\site-packages\irsx\XML\201533089349301428_public.xml"
fh = open(filepath, 'r')
raw_file = fh.read()
raw_irs_dict = xmltodict.parse(raw_file)


Traceback (most recent call last):
  File "xmltodict_test.py", line 6, in <module>
    raw_irs_dict = xmltodict.parse(raw_file)
  File "C:\Users\eharv\Anaconda3\lib\site-packages\xmltodict.py", line 330, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0

This isn't an issue in linux / mac; the first char is id'ed as \ufeff. On an anaconda terminal, this char is represented as:  . Presumably this is some default codec / encoding type issue.

It seems like this can be fixed by forcing encoding to be 'utf-8-sig' https://docs.python.org/3/library/codecs.html#encodings-and-unicode

closed in 96fab62