Azure/spark-cdm-connector

Implicit Parquet format reading from ADLSGen2 throws Null pointer exception

Opened this issue · 0 comments

Tried implicit writing and reading using parquet format, successfully able to write it in ADLSGen2,able to read the schema from dataframe but when try to read it throws NPE(null pointer exception) ,
Same entity can be able to write and read in csv format.

Implicit Writing Logic:
(df_supplies.write.format("com.microsoft.cdm")
.option("storage", storageAccountName)
.option("manifestPath", container + "/test8/default.manifest.cdm.json")
.option("entity", "supplies")
.option("appId", appID)
.option("appKey", appKey)
.option("tenantId", tenantID)
.option("format", "parquet")
.option("compression", "gzip")
.save())

Implicit Reading Logic:
readDf = (spark.read.format("com.microsoft.cdm")
.option("storage", storageAccountName)
.option("manifestPath", container + "/test8/default.manifest.cdm.json")
.option("entity", "supplies")
.option("appId", appID)
.option("appKey", appKey)
.option("tenantId", tenantID)
.load())

Error Log:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 47.0 failed 4 times, most recent failure: Lost task 0.3 in stage 47.0 (TID 1397, 10.139.64.8, executor 5): java.lang.NullPointerException
Caused by: java.lang.NullPointerException
at com.microsoft.cdm.read.ParquetReaderConnector.jsonToData(ParquetReaderConnector.scala:231)
at com.microsoft.cdm.read.CDMDataReader$$anonfun$get$3.apply(CDMDataReader.scala:85)
at com.microsoft.cdm.read.CDMDataReader$$anonfun$get$3.apply(CDMDataReader.scala:83)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at com.microsoft.cdm.read.CDMDataReader.get(CDMDataReader.scala:83)
at com.microsoft.cdm.read.CDMDataReader.get(CDMDataReader.scala:19)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.next(DataSourceRDD.scala:59)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)