GoogleCloudDataproc/hadoop-connectors

Issue with cached credentials when attempting to use different keyfiles in the same Spark App

josecsotomorales opened this issue · 2 comments

Hey folks, I have a Spark Application that reads from a source bucket and writes into a target bucket. I'm experiencing some issues when setting the keyfile for the second operation, as a Hadoop configuration, in theory, the keyfile should get overridden, but it's not the case, the application always uses the first keyfile, I tried to unset, and clear hadoop configs and everything but for whatever reason the connector always uses the first credentials file. Here is a code snippet of what I'm trying to accomplish:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Multiple GCS Service Accounts") \
    .getOrCreate()

spark.conf.set("spark.hadoop.fs.gs.auth.service.account", "/path/to/first/keyfile.json")

# Perform Spark operations using the first key file

# Switch to a different key file
spark.conf.set("spark.hadoop.fs.gs.auth.service.account", "/path/to/second/keyfile.json")

# Perform Spark operations using the second key file

spark.stop()

For Hadoop AWS and Hadoop Azure connectors, there's multiple ways to set credentials per bucket, I would like to have the same in the GCS connector, for example:

  // See the bucket variable, I can set keys per bucket
  spark.sparkContext.hadoopConfiguration.set(s"fs.s3a.bucket.$bucket.access.key", accessKey)
  spark.sparkContext.hadoopConfiguration.set(s"fs.s3a.bucket.$bucket.secret.key", secretKey)

@medb @singhravidutt do you know if this is even possible with the current implementation?