Azure/spark-cdm-connector

Spark-cdm-connector 0.19.1 - java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport

Closed this issue · 7 comments

I am following this example here and getting the following error when I run this portion

Creates the CDM manifest and adds the entity to it with gzip'd parquet partitions

with both physical and logical entity definitions

(df.write.format("com.microsoft.cdm")
.option("storage", StorageAccount)
.option("manifestPath", "/powerbi/adlsgen2isleghaz/covid19datasetmlDataset/default.manifest.cdm.json")
.option("entity", "TestEntity")
.option("format", "parquet")
.option("compression", "gzip")
.save())

java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport

Py4JJavaError Traceback (most recent call last)
in
1 # Creates the CDM manifest and adds the entity to it with gzip'd parquet partitions
2 # with both physical and logical entity definitions
----> 3 (df.write.format("com.microsoft.cdm")
4 .option("storage", StorageAccount)
5 .option("manifestPath", "/powerbi/adlsgen2isleghaz/covid19datasetmlDataset/default.manifest.cdm.json")

/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
1132 self.format(format)
1133 if path is None:
-> 1134 self._jwrite.save()
1135 else:
1136 self._jwrite.save(path)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in call(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)

After changing the version of the cluster to use spark 2.4 I see that I still get issue with trying to create a new manifest.cdm.json or reading an exisitng manifest.cdm.json
I now get this issue -1 error code: null error message: InvalidAbfsRestOperationExceptionjava.net.UnknownHostException: https

I got rid of the https:// portion and I now get this issue:
HEAD https://dlacopdemocomm02.dfs.core.windows.net/power-bi-cdm/powerbi-dataflow/WideWorldImporters/model.json?timeout=90

Py4JJavaError Traceback (most recent call last)
in
2 .option("storage", storageAccountName)
3 .option("manifestPath", "power-bi-cdm/powerbi-dataflow/WideWorldImporters/model.json")
----> 4 .option("entity", "Sales Customers")
5 #.option("appId", appid)
6 #.option("appKey", appkey)

/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
170 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
171 else:
--> 172 return self._df(self._jreader.load())
173
174 @SInCE(1.4)

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258

Hi, you need to give Storage Blob Data Contributer access to the identity.

@srichetar The account already has storage blob data contributor access to the identity.
image

Please email asksparkcdm@microsoft.com if you are still facing the issue.

I faced this issue when I was using spark-cdm connector 0.19.1 with databricks runtime 8.x - they are incompatible with each other, I started using databricks 6.4 which fixed this issue.

databricks 6.4 which fixed this issue.