Connectors hadoop3-2.2.8+ are incompatible with latest version of spark-sql (3.3.1)
bsenyshyn opened this issue · 3 comments
bsenyshyn commented
Hello!
I tried to upgrade the gcs-connector
from 2.2.7
to the latest version. I have faced this issue after the bump to 2.2.9
(just trying to read a DF from gcs bucket):
[error] Exception in thread "main" java.lang.NoSuchMethodError: 'com.google.common.cache.CacheBuilder com.google.common.cache.CacheBuilder.expireAfterWrite(java.time.Duration)'
[error] at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.<init>(GoogleCloudStorageImpl.java:206)
[error] at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.<init>(GoogleCloudStorageImpl.java:306)
[error] at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.<init>(GoogleCloudStorageFileSystem.java:171)
[error] at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createGcsFs(GoogleHadoopFileSystemBase.java:1506)
[error] at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1480)
[error] at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:467)
[error] at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
[error] at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
[error] at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
[error] at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
[error] at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
[error] at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288)
[error] at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:524)
[error] at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
[error] at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
[error] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
[error] at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
[error] at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
[error] at scala.Option.getOrElse(Option.scala:189)
[error] at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
[error] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:562)
[error] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:547)
...
This problem occurs both on 2.2.8
and 2.2.9
. I'm using:
- Java 17
- Scala 2.12.17
- Spark SQL 3.3.1
Looking forward to your answer!
arunkumarchacko commented
Thank you for reporting this.
Can you please provide some more information.
- Are you running this from a Dataproc cluster. If yes, can you please the Dataproc image version? If you can provide all the relevant parameter you used to create the cluster, it will help me reproduce this issue.
- Can you please confirm that GCS connector version change is the only change?
- Can you please confirm that this works for 2.2.7 version of GCS connector?
- How did you get hold of the connector? Did you get it from Maven for eg.
bsenyshyn commented
- The application is created locally without Dataproc cluster. Spark configuration is basic:
spark.app.name = "Example application"
spark.master = "local[*]"
spark.hadoop.fs.defaultFS = "gs://YOUR_BUCKET_NAME"
spark.hadoop.fs.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
spark.hadoop.google.cloud.auth.service.account.enable = true
spark.hadoop.google.cloud.auth.service.account.json.keyfile = "path/to/gcp_key.json"
- Yes, GCS connector is the only change.
- Yes, this works for 2.2.7 version.
- Yes, the connector is from Maven, imported via sbt (scala build tool):
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.9"
medb commented
This is caused by the fact that Spark uses an old Guava library version and you used a non-shaded GCS connector jar. To make it work, you just need to use shaded GCS connector jar from Maven, for example: https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.9/gcs-connector-hadoop3-2.2.9-shaded.jar