bullhorn/dataloader

convertAttachments command returning "org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared"

Opened this issue · 1 comments

Description

When running the "convertAttachments" command with the dataloader, some files (usually .rtf files) return the error "org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared". The files are able to be opened, and are not corrupt.

The same error occurs less frequently with loadAttachments

Steps to Reproduce

  1. Convert attachments through the dataloader
Expected behavior:

A converted html version of the file be created in "convertedAttachments"

Actual behavior:

An error is returned

Reproduces how often:

Depends on the client. For rtf files with our most recent client, we are seeing about 15% fail

Additional Information

  1. The dataloader.properties file (minus login info)
    dataloader (1).properties.txt

  2. The CSV input file (smallest possible file that reproduces the issue)
    candidatefiles-20210728w3.txt

  3. The results file(s)
    candidatefiles-20210728w3_convertAttachments_2021-07-28_13.48.49_failure.txt

  4. The log file
    dataloader_2021-07-28_13.48.49.log

Sorry you're experiencing this. The convert attachments is using Apache Tika (https://tika.apache.org/) to parse the files and convert them into the HTML version. It looks like there is a newer version of Tika available since the version Data Loader is using currently. This may include features in Tika that will work around the problem in the files. I will include an upgrade of Tika in the next version of Data Loader and maybe that will allow for files with the missing XML info to be parsed.