samelamin/spark-bigquery

setBigQueryDatasetLocation break YARN mode

Closed this issue · 5 comments

Thanks for the library, seems to be working great!

I ran into a very odd issue that took me far too long to figure out. I'm running Spark 2.1 on Qubole (on AWS) and after getting everything working (my query would complete, etc) was still having the job fail at the end with a particularly odd error:

org.apache.spark.SparkException: YarnSparkHadoopUtil is not available in non-YARN mode!
    at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:131)
    at org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:96)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:220)
    at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
    at scala.util.Try$.apply(Try.scala:192)
    at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
    at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
    at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

Doing some binary search commenting out of code to figure out where the problem was and I narrowed it down to this line:

spark.sqlContext.setBigQueryDatasetLocation(bqConfig.getString("datasetLocation"))

(note that config value resolves to "US" which should also be the default)

Removing that line and everything stays in YARN mode and works just fine. Given that the default works fine for my use-case I didn't spend more time tracking down what might be going on, but figured I'd report the issue.

Thanks again!

After a bit more investigation it seems like whatever

spark.sqlContext.setBigQueryDatasetLocation(bqConfig.getString("datasetLocation"))

triggers also happens when you run

spark.sqlContext.bigQuerySelect(someQuery)

if I run

spark.sqlContext.bigQueryTable(someTable)

the yarn stuff all continues to function, but if I run bigQuerySelect the job invariably fails complaining

org.apache.spark.SparkException: YarnSparkHadoopUtil is not available in non-YARN mode!

thanks for reporting this @vijaykramesh. Sorry for the late reply I was away for the weekend
Sounds like its an issue specific to Qubole and its setup with Yarn which I have never personally used

I suggest trying to replicate it locally because it sounds like it is an environment specific issue. Sorry I cant be much help!

@vijaykramesh I have same issue when running the job in AWS Glue. Did you find how to fix it?

@vijaykramesh I applied your fix and still it didn’t help. Do you have any idea what else could cause it?

I figured it out. I used an old version of the library where the fix wasn't merged. Everything is OK now. Thanks you for the fix and for the library.