GoogleCloudDataproc/spark-bigquery-connector

Spark BQ connector doesn't work when reading table that is partitioned?

gomrinal opened this issue · 5 comments

Note: I am testing/running spark in Dataproc so it has spark-bigquery-connector pre-installed!

Problem: It looks like Spark BQ connector doesn't work when reading table that is partitioned?

Getting this error while reading a BQ table which is partitioned.

Py4JJavaError: An error occurred while calling o86.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.UnavailableException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: UNAVAILABLE: The service is currently unavailable.
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('read data from bq').getOrCreate()

table = '<project_id>.<dataset_name>.<table_name>'
data = spark.read.format('bigquery').option('table',table).load()

# Error happens when I use actions like `.show()`
data.show()

Same code works for the table that is not partitioned!

Can you please share:

  • Which Spark and Scala version do you use?
  • Can you share the full stack trace?

Python Version : Python 3.10.8

Spark Version: v3.3.2

However, for the same table, reading operation works if I use a different way like:

from pyspark.sql import SparkSession
from google.cloud import bigquery

spark = SparkSession.builder \
        .appName("BQ read example")\
        .getOrCreate()
QUERY = """
SELECT *
FROM <table>
LIMIT 1000
"""
bq = bigquery.Client()
query_job = bq.query(QUERY)
query_job.result()

df = spark.read.format('bigquery') \
    .option('dataset', query_job.destination.dataset_id) \
    .load(query_job.destination.table_id)

df.show()

@gomrinal Can you please share the full stack trace of the error?