GoogleCloudDataproc/spark-bigquery-connector

Spark structured streaming jobs failing with Bigquery session expired

Closed this issue · 3 comments

Hi @davidrabinowitz @kmjung
Out structured streaming jobs running continuously and its loads the data from bigquery table every 12 hrs once and cache it. whenever it refresh the cache every 12hrs, job getting failed with session expired exception
Exception details:
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: FAILED_PRECONDITION: there was an error operating on 'projects/xxxxx/locations/us/sessions/CAISDEZfZDFIdTZLcjNiThoCanEaAml3/streams/GgJqcRoCaXcoAg': session expired at 2024-08-21T00:06:18+00:00

Dependancy details:

spark-bigquery-with-dependencies_2.12_0.28.0
spark version -3.2.1

As suggested by @davidrabinowitz we are aware that session will expiry time of 6hrs
We are looking support to keep the session not closed for spark jobs
Please suggest how can we do this?

Thanks in advance

You can cache the data locally first, either by using .cache(), or by explicitly writing the data to GCS or even local HDFS and read from their. What is the size of the data?

Our data will keep change every 12hrs in the Bigquery table. we need to fire new query to bigquery table to get latest data every 12hrs, during this window our Bigquery session getting expired. How can we recreate the bigquery session every time i query the bigquery table?
I understand your suggestion for writing the data to GCS. But this is not we wanted.

You can read from the table every 6 hours (or for the safe side every less than 6 hours). Caching the results to local HDFS meant to save the cost of querying the same data twice in 12 hours.