Azure/spark-cdm-connector

TextParsingException for large columns. Unable to set univocity parser settings.

Closed this issue · 6 comments

Attempting to load from a CDM entity with embedded emails with embedded images.
Trying to set the option
.option("maxCharsPerCol",-1)
or
.option("maxCharsPerColumn",-1)
or any large numeric setting such as
.option("maxCharsPerCol",10000000)
still fails with:
com.univocity.parsers.common.TextParsingException: Length of parsed input (500001) exceeds the maximum number of characters defined in your parser settings (500000)
Please expose the option to update the parser settings.

Thanks for find the issue. This issue is now fixed and will be included in the next release.

@bissont Do you have a date for the next release? Thank you.

This issue is now fixed. Please use the latest release

Hi @srichetar,
Was there meant to be a release yesterday? The current release is still linked to the March release. Are you suggesting I rebuild the library from the master branch?

We have pushed the source code for spark3 to this repository (in the master branch). You can now build the uber jar yourself. However, if you are using databricks, only app-registration authentication works (working with databricks on the issue) and this issue in this "issue" is resolved. For synapse, the jars are already in the vhd by default.

Assuming you are using Databricks, you can test it out wit the jar under artifacts:
https://github.com/Azure/spark-cdm-connector/blob/master/artifacts/spark-cdm-connector-spark3-assembly-databricks-cred-passthrough-not-working-1.19.2.jar

We haven't released the jar officially because of the issue with credential passthrough.

Excellent. Thanks for the update @bissont