ecederstrand/exchangelib

Attachment Names with non XML supported characters causes SAXParseException error

amshamah419 opened this issue · 2 comments

Describe the bug
When calling the content method on a FileAttachment object, if the returned xml from EWS contains a non-xml supported character such as the following:




Then upon loading the FileIO into memory as a stream, the XML fails to parse and throws an exception.

I believe the line causing the error is here in exchangelib/util.py:

    def feed(self, data, isFinal=0):
        """Yield the current content of the character buffer."""
        DefusedExpatParser.feed(self, data=data, isFinal=isFinal)
        return self._decode_buffer()

The thrown error is here:

Error: Got an error entry for fetch incidents [Stderr: Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/xml/sax/expatreader.py", line 217, in feed
self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: reference to invalid character number: line 1, column 1062
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "<string>", line 14044, in process_main
File "<string>", line 13931, in sub_main
File "<string>", line 13733, in fetch_emails_as_incidents
File "<string>", line 13566, in parse_incident_from_item
File "/usr/local/lib/python3.10/site-packages/exchangelib/attachments.py", line 154, in content
self._content = fp.read()
File "/usr/local/lib/python3.10/site-packages/exchangelib/attachments.py", line 258, in readinto
chunk = self._overflow or next(self._stream)
File "/usr/local/lib/python3.10/site-packages/exchangelib/services/get_attachment.py", line 104, in stream_file_content
yield from self._get_response_xml(payload=payload, stream_file_content=True)
File "/usr/local/lib/python3.10/site-packages/exchangelib/util.py", line 366, in parse
yield from self.feed(buffer)
File "/usr/local/lib/python3.10/site-packages/exchangelib/util.py", line 377, in feed
DefusedExpatParser.feed(self, data=data, isFinal=isFinal)
File "/usr/local/lib/python3.10/xml/sax/expatreader.py", line 221, in feed
self._err_handler.fatalError(exc)
File "/usr/local/lib/python3.10/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:1062: reference to invalid character number
] (66)

To Reproduce
We are an open source repo and the full code can be seen here if needed - https://github.com/demisto/content/blob/master/Packs/MicrosoftExchangeOnline/Integrations/EWSO365/EWSO365.py#L2092

This is more or less an abbreviated version of the code we are using to load the attachment:

    if item.attachments:
        for attachment in item.attachments:
            if isinstance(attachment, FileAttachment):
                try:
                    if attachment.content:
                        # file attachment
                        label_attachment_type = "attachments"
                        label_attachment_id_type = "attachmentId"

                        # save the attachment
                        file_name = get_attachment_name(attachment.name)
                        
                except TypeError as e:
                    if str(e) != "must be string or buffer, not None":
                        raise
                    continue
                except xml.sax.SAXParseException as e:
                    print("Error during XML parsing:")
                    print("Message:", e.getMessage())
                    continue

I cannot include the .msg file, but I can include a sample XML file which reproduces the issue -
data.xml.txt

Expected behavior
It really just depends on how the library should handle issues like this. If it's something that Microsoft would normally correct when sending the XML (doubtful), then we should raise the issue there and better handle the error here. If not, then we could try stripping control chars from the xml prior to attempting to parse.

Log output
If applicable, add relevant output from debug logging. - If I can redact sensitive info, I can provide.

Additional context
Python 3.10
exchangelib==5.0.3 but does reproduce in earlier versions.

Congratulations on breaking the XML parser! That's not an easy task😃

I think we should be able to handle this gracefully in exchangelib. I'll just need some time to write a test case and come up with a fix.

Congratulations on breaking the XML parser! That's not an easy task😃

hidethepain

That's pretty much been my experience with Microsoft Exchange lately with them deprecating RPS.

I think we should be able to handle this gracefully in exchangelib. I'll just need some time to write a test case and come up with a fix.

If there is anything you need, feel free to let me know. I have a replicating environment and can try sending an similar email if you need. In the mean time, I'll just catch and log the error 😄