Azure/spark-cdm-connector

Databricks 10.5/Spark 3.2.1: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport

Closed this issue · 1 comments

While running below code in databricks (Databricks Runtime Version = 10.5 (includes Apache Spark 3.2.1, Scala 2.12)) - Library which we have installed is not working with current cluster but working with 6.4 runtime version.

Error:
java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport

Write a CDM entity with Parquet data files, entity definition is derived from the dataframe schema

d = datetime.strptime("2015-03-31", '%Y-%m-%d')
ts = datetime.now()
data = [
["a", 1, True, 12.34, 6, d, ts, Decimal(1.4337879), Decimal(999.00), Decimal(18.8)],
["b", 1, True, 12.34, 6, d, ts, Decimal(1.4337879), Decimal(999.00), Decimal(18.8)]
]

schema = (StructType()
.add(StructField("name", StringType(), True))
.add(StructField("id", IntegerType(), True))
.add(StructField("flag", BooleanType(), True))
.add(StructField("salary", DoubleType(), True))
.add(StructField("phone", LongType(), True))
.add(StructField("dob", DateType(), True))
.add(StructField("time", TimestampType(), True))
.add(StructField("decimal1", DecimalType(15, 3), True))
.add(StructField("decimal2", DecimalType(38, 7), True))
.add(StructField("decimal3", DecimalType(5, 2), True))
)

df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

Creates the CDM manifest and adds the entity to it with gzip'd parquet partitions

with both physical and logical entity definitions

(df.write.format("com.microsoft.cdm")
.option("storage", storageAccountName)
.option("manifestPath", container + "/implicitTest/default.manifest.cdm.json")
.option("entity", "TestEntity")
.option("format", "parquet")
.option("compression", "gzip")
.save())

Append the same dataframe content to the entity in the default CSV format

(df.write.format("com.microsoft.cdm")
.option("storage", storageAccountName)
.option("manifestPath", container + "/implicitTest/default.manifest.cdm.json")
.option("entity", "TestEntity")
.mode("append")
.save())

readDf = (spark.read.format("com.microsoft.cdm")
.option("storage", storageAccountName)
.option("manifestPath", container + "/implicitTest/default.manifest.cdm.json")
.option("entity", "TestEntity")
.load())

readDf.select("*").show()

Need your help here to direct us to right library so that we can create entity tables in databricks.

See issue #92.