GoogleCloudDataproc/spark-bigquery-connector

InvalidClassException with spark-bigquery-with-dependencies_2.12-0.34.0.jar, scala version 2.12, spark version 3.3.2

Closed this issue · 2 comments

lendle commented

I'm running on GCP dataproc, image version 2.1

Spark version: 3.3.2
Scala version 2.12.18

I'm running in a jupyter notebook with the Python 3 kernel rather than the PySpark kernel so I can set spark.jars. I also found I had to set spark.exector.userClassPathFirst, which I didn't find to be mentioned in the docs anywhere, so I'm wondering if I'm doing something incorrectly.

from pyspark.sql import SparkSession

spark_bq_jar="gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.34.0.jar"

spark = (SparkSession.builder
         .config('spark.executor.userClassPathFirst', 'true')
         .config('spark.jars', spark_bq_jar)
         .getOrCreate())

df_wiki_pageviews = spark.read \
  .format("bigquery") \
  .option("table", "bigquery-public-data.wikipedia.pageviews_2020") \
  .option("filter", "datehour >= '2020-03-01' AND datehour < '2020-03-02'") \
  .load()

df_wiki_pageviews.show()
Stacktrace

23/11/17 22:34:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (<OMITTED>.internal executor 1): java.io.InvalidClassException: com.google.cloud.spark.bigquery.direct.PreScala213BigQueryRDD; local class incompatible: stream classdesc serialVersionUID = -5329329728832024890, local class serialVersionUID = 2615182009250327661
	at java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:560)
	at java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2020)
	at java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1870)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2201)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
	at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496)
	at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390)
	at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
	at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489)
	at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447)
...

Using spark_bq_jar="gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar" works. That's the newest version I found to work.

lendle commented

I just found that /usr/local/share/google/dataproc/lib/spark-bigquery-with-dependencies_2.12-0.27.1.jar is already being included in the jvm classpath on dataproc via /etc/hadoop/conf/yarn-site.xml, which explains the issue.

The dataproc docs say the jar needs to be made available but apparently it's already available, leading to the issue. I am reporting an issue with the docs there.

Dataproc started to include BQ connector by default since 2.1. The doc should be updated with 2.1 specific info. Thanks for the feedback!