GoogleCloudDataproc/spark-bigquery-connector

Predicate pushdown doesn't work with DateTime BQ field - Spark 3.5, connector version 0.37

Closed this issue · 1 comments

Hi,

In several processes we use the spark connector to read data from a bigquery table, applying a where condition as a filter on a DATETIME column. It was previously TIMESTAMP but we had to switch to DATETIME due to some limits: b/333159527.

df_bigquery = spark.read.format("bigquery").option("table", table_to_read).load().where(f"{delta_field} < '{data_value_end}'")

Where delta_field is a Column of type DATETIME.

Before switching to DATETIME, when the column was of type TIMESTAMP, using a where condition where = "dt_field > '2024-04-19 12:26:01'" the filter worked and by checking the connector in detail it pushed the data correctly filters towards bigquery, so as to bring to dataproc only the records that respected the where condition.

After switching to DATETIME, we had to adapt this piece of where condition, we did several tests:

  • Reading as with TIMESTAMP so for example with where = "dt_field > '2024-04-19 12:26:01'" -> we get an error

  • reading with where = "dt_field > CAST('2024-04-19 12:26:01' AS TIMESTAMP_NTZ)" -> we get an error

  • reading with where using the function to_timestamp_ntz -> works, BUT we noticed that the filters are not pushed to bigquery. This is especially problematic given the size of many tables we will have to work with .

What are we missing in using predicate pushdown for DATETIME field?

Fixed : 791397e
Will be available in next release (0.38)