convertAttachments command returning "org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared"
Opened this issue · 1 comments
Description
When running the "convertAttachments" command with the dataloader, some files (usually .rtf files) return the error "org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared". The files are able to be opened, and are not corrupt.
The same error occurs less frequently with loadAttachments
Steps to Reproduce
- Convert attachments through the dataloader
Expected behavior:
A converted html version of the file be created in "convertedAttachments"
Actual behavior:
An error is returned
Reproduces how often:
Depends on the client. For rtf files with our most recent client, we are seeing about 15% fail
Additional Information
-
The
dataloader.properties
file (minus login info)
dataloader (1).properties.txt -
The CSV input file (smallest possible file that reproduces the issue)
candidatefiles-20210728w3.txt -
The results file(s)
candidatefiles-20210728w3_convertAttachments_2021-07-28_13.48.49_failure.txt -
The log file
dataloader_2021-07-28_13.48.49.log
Sorry you're experiencing this. The convert attachments is using Apache Tika (https://tika.apache.org/) to parse the files and convert them into the HTML version. It looks like there is a newer version of Tika available since the version Data Loader is using currently. This may include features in Tika that will work around the problem in the files. I will include an upgrade of Tika in the next version of Data Loader and maybe that will allow for files with the missing XML info to be parsed.