Azure/spark-cdm-connector

Does this connector work for open source Spark?

Closed this issue · 7 comments

I am using Pyspark 3.2 and have used the following code to install the dependencies. However, it doesn't work and I keep getting the following:

CODE:

spark = (SparkSession.builder
.appName("NewApp")
.master("local[3]")
.config("spark.jars.packages","org.apache.hadoop:hadoop-azure:3.3.1,com.microsoft.azure:spark-cdm-connector:0.19.0")
.config("fs.azure.account.auth.type.abfswales1.dfs.core.windows.net","SharedKey")
.config("fs.azure.account.key..dfs.core.windows.net","SharedKeyFromAzurePortal")
.getOrCreate())

readDf = (spark.read.format("com.microsoft.cdm")
.option("storage", storageAccountName)
.option("manifestPath", "Dataverse-storage" + "/model.json")
.option("entity", "account")
.load())

and I am getting the following error:

Py4JJavaError: An error occurred while calling o59.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Hi @RedwanAlkurdi. Are you using the connector on Azure Synapse or Azure Databricks?

Hi @srichetar. Nope, I am using an on-prem dev environment.

The code is open-sourced and works with spark3. Can you try to build it locally and use the jar then?

Sure, I’ll try it out and let you know

@srichetar

well, it worked after I built it. However, it doesn't work with an on-prem dev environment, which is quite strange actually, because read-only does not have any effect on the managed identities. Moreover, it is reading from the snapshots anyway, which will not interfere with any Dataverse dataflows/links.

Py4JJavaError: An error occurred while calling o60.load.
: java.lang.Exception: Managed identities only supported on Synapse or Databricks
at com.microsoft.cdm.utils.CDMOptions.(CDMOptions.scala:49)
at com.microsoft.cdm.CDMIdentifier.(CDMIdentifier.scala:10)

is there anyway to change this behaviour?, because it seems internal to the connector

Please use credential based authentication to work with on-prem dev environment.

Yeah, I figured that out, just forgot to close it, closing the issue now. Thank you :)