Azure/spark-cdm-connector

CDM read inference failing: The number of columns in CSV/parquet file is not equal to the number of fields in Spark StructType.

Closed this issue · 5 comments

Microsoft released a change of Sep 26th 2021 to support owning business security on the data, which introduced the security attribute "owningbusinessunitname" to all User owned entities. This has introduced a new null value at the end of each line in the csv data output and has caused the following issue across most entities:
"Caused by: java.lang.Exception: The number of columns in CSV/parquet file is not equal to the number of fields in Spark StructType. Either modify the attributes in manifest to make it equal to the number of columns in CSV/parquet files or modify the csv/parquet file"

This has required a full resynchronization of the Dataverse output entities, but we'd like the reader to be more adaptive to changes, whether introduced by Microsoft or not.

I'll provide more details shortly.

More details, relating to the Dataverse What happens when I add a column?

This results in a "malformed csv", as only records that have been updated since the new columns were added will match the schema listed in the manifest. For example, a new column and single record update on a dataset that had five columns, would result in a dataset where most of the records will still only have five columns except the updated record which will have six.

This can be tested by reading the data using the native csv data reader with mode=FAILFAST, which will report this malformation. If this mode is not set and the schema from the manifest is applied, a dataset of all NULL values will be returned.

An update is required to this connector which allows the ingestion of a csv file with records containing a different number of columns to each other.

Thanks for the comment. Yes, we are looking into supporting schema drift when appending columns for new partitions.

This issue is seemingly resolved by the use of the following option as well:
.option("mode", "permissive")

Records without the new column will be filled with null.

We ran into the same exception "The number of columns in CSV/parquet file is not equal to the number of fields in Spark StructType. Either modify the attributes in manifest to make it equal to the number of columns in CSV/parquet files or modify the csv/parquet file".

Using the option mode:permissive, mentioned by @Nuglar solved our problem. The cause was a newly added column that caused schema drift in the Table and Changefeed files of an entity.

As mentioned, use permissive mode.

.option("mode", "permissive")

It will fill new columns with null. #84 (comment)