/spark-bigquery-connector

The connector uses the Spark SQL Data Source API to read data from Google BigQuery.

Primary LanguageScalaApache License 2.0Apache-2.0

Apache Spark SQL connector for Google BigQuery (Beta)

The connector supports reading Google BigQuery tables into Spark's DataFrames, and writing DataFrames back into BigQuery. This is done by using the Spark SQL Data Source API to communicate with BigQuery.

Beta Disclaimer

The BigQuery Storage API and this connector are in Beta and are subject to change.

Changes may include, but are not limited to:

  • Type conversion
  • Partitioning
  • Parameters

Breaking changes will be restricted to major and minor versions.

BigQuery Storage API

The Storage API streams data in parallel directly from BigQuery via gRPC without using Google Cloud Storage as an intermediary.

It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:

Direct Streaming

It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.

Filtering

The new API allows column and predicate filtering to only read the data you are interested in.

Column Filtering

Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.

Predicate Filtering

The Storage API supports arbitrary pushdown of predicate filters. Connector version 0.8.0-beta and above support pushdown of arbitrary filters to Bigquery.

There is a known issue in Spark that does not allow pushdown of filters on nested fields. For example - filters like address.city = "Sunnyvale" will not get pushdown to Bigquery.

Dynamic Sharding

The API rebalances records between readers until they all complete. This means that all Map phases will finish nearly concurrently. See this blog article on how dynamic sharding is similarly used in Google Cloud Dataflow.

See Configuring Partitioning for more details.

Requirements

Enable the BigQuery Storage API

Follow these instructions.

Create a Google Cloud Dataproc cluster (Optional)

If you do not have an Apache Spark environment you can create a Cloud Dataproc cluster with pre-configured auth. The following examples assume you are using Cloud Dataproc, but you can use spark-submit on any cluster.

Any Dataproc cluster using the API needs the 'bigquery' or 'cloud-platform' scopes. Dataproc clusters have the 'bigquery' scope by default, so most clusters in enabled projects should work by default e.g.

MY_CLUSTER=...
gcloud dataproc clusters create "$MY_CLUSTER"

Downloading the Connector

The latest version connector of the connector is publicly available in gs://spark-lib/bigquery/spark-bigquery-latest.jar. A Scala 2.12 compiled version exist in gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar.

Hello World Example

You can run a simple PySpark wordcount against the API without compilation by running

gcloud dataproc jobs submit pyspark --cluster "$MY_CLUSTER" \
  --jars gs://spark-lib/bigquery/spark-bigquery-latest.jar \
  examples/python/shakespeare.py

Example Codelab

https://codelabs.developers.google.com/codelabs/pyspark-bigquery

Compiling against the connector

Unless you wish to use the implicit Scala API spark.read.bigquery("TABLE_ID"), there is no need to compile against the connector.

To include the connector in your project:

Maven

<dependency>
  <groupId>com.google.cloud.spark</groupId>
  <artifactId>spark-bigquery_${scala.version}</artifactId>
  <version>0.11.0-beta</version>
  <classifier>shaded</classifier>
</dependency>

SBT

libraryDependencies += "com.google.cloud.spark" %% "spark-bigquery" % "0.11.0-beta" classifier "shaded"

API

The connector uses the cross language Spark SQL Data Source API:

Reading data from BigQuery

df = spark.read
  .format("bigquery")
  .option("table", "publicdata.samples.shakespeare")
  .load()

or the Scala only implicit API:

import com.google.cloud.spark.bigquery._
val df = spark.read.bigquery("publicdata.samples.shakespeare")

See Shakespeare.scala and shakespeare.py for more information.

Writing data to BigQuery

Writing a DataFrame to BigQuery is done in a similar manner. Notice that the process writes the data first to GCS and then loads it to BigQuery, a GCS bucket must be configured to indicate the temporary data location.

df.write
  .format("bigquery")
  .option("table","dataset.table")
  .option("temporaryGcsBucket","some-bucket")
  .save()

The data is temporarily stored using the Apache parquet format. An alternative format is Apache ORC.

The GCS bucket and the format can also be set globally using Spark"s RuntimeConfig like this:

spark.conf.set("temporaryGcsBucket","some-bucket")
df.write
  .format("bigquery")
  .option("table","dataset.table")
  .save()

Inportant: The connector does not configure the GCS connector, in order to avoid conflict with another GCS connector, if exists. In order to use the write capabilities of the connector, please configure the GCS connector on your cluster as explained here.

Properties

The API Supports a number of options to configure the read

Property Meaning Usage
table The BigQuery table in the format [[project:]dataset.]table. (Required) Read/Write
dataset The dataset containing the table.
(Optional unless omitted in table)
Read/Write
project The Google Cloud Project ID of the table.
(Optional. Defaults to the project of the Service Account being used)
Read/Write
parentProject The Google Cloud Project ID of the table to bill for the export.
(Optional. Defaults to the project of the Service Account being used)
Read/Write
parallelism The number of partitions to split the data into. Actual number may be less if BigQuery deems the data small enough. If there are not enough executors to schedule a reader per partition, some partitions may be empty.
(Optional. Defaults to one partition per 400MB. See Configuring Partitioning.)
Read
viewsEnabled Enables the connector to read from views and not only tables. Please read the relevant section before activating this option.
(Optional. Defaults to false)
Read
viewMaterializationProject The project id where the materialized view is going to be created
(Optional. Defaults to view's project id)
Read
viewMaterializationDataset The dataset where the materialized view is going to be created
(Optional. Defaults to view's dataset)
Read
temporaryGcsBucket The GCS bucket that temporarily holds the data before it is loaded to BigQuery. Required unless set in the Spark configuration (spark.conf.set(...)). Write
intermediateFormat The format of the data before it is loaded to BigQuery, values can be either "parquet" or "orc".
(Optional. Defaults to parquet). On write only.
Write

Data types

With the exception of DATETIME and TIME all BigQuery data types directed map into the corresponding Spark SQL data type. Here are all of the mappings:

BigQuery Standard SQL Data Type Spark SQL

Data Type

Notes
BOOL BooleanType
INT64 LongType
FLOAT64 DoubleType
NUMERIC DecimalType This preserves NUMERIC's full 38 digits of precision and 9 digits of scope.
STRING StringType
BYTES BinaryType
STRUCT StructType
ARRAY ArrayType
TIMESTAMP TimestampType
DATE DateType
DATETIME StringType Spark has no DATETIME type. Casting to TIMESTAMP uses a configured TimeZone, which defaults to the local timezone (UTC in GCE / Dataproc).

We are considering adding an optional TimeZone property to allow automatically converting to TimeStamp, this would be consistent with Spark's handling of CSV/JSON (except they always try to convert when inferring schema, and default to the local timezone)

TIME LongType Spark has no TIME type. The generated longs, which indicate microseconds since midnight can be safely cast to TimestampType, but this causes the date to be inferred as the current day. Thus times are left as longs and user can cast if they like.

When casting to Timestamp TIME have the same TimeZone issues as DATETIME

Filtering

The connector automatically computes column and pushdown filters the DataFrame's SELECT statement e.g.

spark.read.bigquery("publicdata:samples.shakespeare")
  .select("word")
  .where("word = 'Hamlet' or word = 'Claudius'")
  .collect()

filters to the column word and pushed down the predicate filter word = 'hamlet' or word = 'Claudius'.

If you do not wish to make multiple read requests to BigQuery, you can cache the DataFrame before filtering e.g.:

val cachedDF = spark.read.bigquery("publicdata:samples.shakespeare").cache()
val rows = cachedDF.select("word")
  .where("word = 'Hamlet'")
  .collect()
// All of the table was cached and this doesn't require an API call
val otherRows = cachedDF.select("word_count")
  .where("word = 'Romeo'")
  .collect()

You can also manually specify the filter option, which will override automatic pushdown and Spark will do the rest of the filtering in the client.

Partitioned Tables

The pseudo columns _PARTITIONDATE and _PARTITIONTIME are not part of the table schema. Therefore in order to query by the partitions of partitioned tables do not use the where() method shown above. Instead, add a filter option in the following manner:

val df = spark.read.format("bigquery")
  .option("table", TABLE)
  .option("filter", "_PARTITIONDATE > '2019-01-01'")
  ...
  .load()

Configuring Partitioning

By default the connector creates one partition per 400MB in the table being read (before filtering). This should roughly correspond to the maximum number of readers supported by the BigQuery Storage API. This can be configured explicitly with the parallelism property. BigQuery may limit the number of partitions based on server constraints.

Reading From Views

The connector has a preliminary support for reading from BigQuery views. Please note there are a few caveats:

  • BigQuery views are not materialized by default, which means that the connector needs to materialize them before it can read them. This process affects the read performance, even before running any collect() or count() action.
  • The materialization process can also incur additional costs to your BigQuery bill.
  • By default, the materialized views are created in the same project and dataset. Those can be configured by the optional viewMaterializationProject and viewMaterializationDataset options, respectively. These options can also be globally set by calling spark.conf.set(...) before reading the views.
  • Reading from views is disabled by default. In order to enable it, either set the viewsEnabled option when reading the specific view (.option("viewsEnabled", "true")) or set it globally by calling spark.conf.set("viewsEnabled", "true").

Building the Connector

The connector is built using SBT. Following command creates a jar with shaded dependencies:

sbt assembly

FAQ

What is the Pricing for the Storage API?

See the BigQuery pricing documentation.

I have very few partitions

You can manually set the number of partitions with the parallelism property. BigQuery may provide fewer partitions than you ask for. See Configuring Partitioning.

You can also always repartition after reading in Spark.

Calling count() takes a long time

When spark runs count() on a DataFrame, it first loads the data and then counts the records. Unfortunately we cannot push this down to the BigQuery side. The best workaround is to run the count on the smallest field in the table (ideally a BOOLEAN or INTEGER). This approach will load less data then running count() in the usual manner.

How do I authenticate outside GCE / Dataproc?

Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described here.

Credentials can also be provided explicitly either as a parameter or from Spark runtime configuration. It can be passed in as a base64-encoded string directly, or a file path that contains the credentials (but not both).

Example:

spark.read.format("bigquery").option("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")

or

spark.conf.set("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")

Alternatively, specify the credentials file name.

spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>")

or

spark.conf.set("credentialsFile", "</path/to/key/file>")