InvalidClassException with spark-bigquery-with-dependencies_2.12-0.34.0.jar, scala version 2.12, spark version 3.3.2
Closed this issue · 2 comments
I'm running on GCP dataproc, image version 2.1
Spark version: 3.3.2
Scala version 2.12.18
I'm running in a jupyter notebook with the Python 3 kernel rather than the PySpark kernel so I can set spark.jars
. I also found I had to set spark.exector.userClassPathFirst
, which I didn't find to be mentioned in the docs anywhere, so I'm wondering if I'm doing something incorrectly.
from pyspark.sql import SparkSession
spark_bq_jar="gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.34.0.jar"
spark = (SparkSession.builder
.config('spark.executor.userClassPathFirst', 'true')
.config('spark.jars', spark_bq_jar)
.getOrCreate())
df_wiki_pageviews = spark.read \
.format("bigquery") \
.option("table", "bigquery-public-data.wikipedia.pageviews_2020") \
.option("filter", "datehour >= '2020-03-01' AND datehour < '2020-03-02'") \
.load()
df_wiki_pageviews.show()
Stacktrace
23/11/17 22:34:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (<OMITTED>.internal executor 1): java.io.InvalidClassException: com.google.cloud.spark.bigquery.direct.PreScala213BigQueryRDD; local class incompatible: stream classdesc serialVersionUID = -5329329728832024890, local class serialVersionUID = 2615182009250327661
at java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:560)
at java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2020)
at java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1870)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2201)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447)
...
Using spark_bq_jar="gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar"
works. That's the newest version I found to work.
I just found that /usr/local/share/google/dataproc/lib/spark-bigquery-with-dependencies_2.12-0.27.1.jar
is already being included in the jvm classpath on dataproc via /etc/hadoop/conf/yarn-site.xml
, which explains the issue.
The dataproc docs say the jar needs to be made available but apparently it's already available, leading to the issue. I am reporting an issue with the docs there.
Dataproc started to include BQ connector by default since 2.1. The doc should be updated with 2.1 specific info. Thanks for the feedback!