samelamin/spark-bigquery

Export FS must derive from GoogleHadoopFileSystemBase

Closed this issue · 1 comments

Hello,

I'm trying to process some data from BigQuery using the local cluster and then write it in hdfs. I keep getting the following error :

java.lang.IllegalStateException: Export FS must derive from GoogleHadoopFileSystemBase. at com.google.common.base.Preconditions.checkState(Preconditions.java:456) at com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration.getTemporaryPathRoot(BigQueryConfiguration.java:363) at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:126) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:130) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1343) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.take(RDD.scala:1337) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1378) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.first(RDD.scala:1377) at com.samelamin.spark.bigquery.BigQuerySQLContext.bigQuerySelect(BigQuerySQLContext.scala:96)

Sample code :

`
spark.sqlContext.setBigQueryProjectId("XXXX")
spark.sqlContext.setBigQueryDatasetLocation("EU")
spark.sqlContext.setBigQueryGcsBucket("XXXXX")
spark.sqlContext.useStandardSQLDialect(true)

        val table = spark
          .sqlContext
          .bigQuerySelect(
            """
              |SELECT a, b, c
              |FROM  `XXX.YYYY.ZZZZ`;
            """.stripMargin)

            table.show

`

Thank you for your help.

Yeah that is to be expected because we need the Google file system since we stage data down to gcs then to the cluster

You can just download the required jars and create an uber jar