GoogleCloudDataproc/spark-bigquery-connector

does spark read from bq multiple times when joining?

Closed this issue · 2 comments

My question is the following,

when doing bq read into a spark dataframe, and then using that dataframe to multiple joins, does spark hit bq multiple times?

Note: Asume a single action at the end.

That depends on the query plan and whether the DataFrame is cached or not. The best is to run .explain() on the result.

That depends on the query plan and whether the DataFrame is cached or not. The best is to run .explain() on the result.

So from looking at the physical plan in the spark ui, would it be correct to mention that the connector hits bq once every time i see the following on the plan

Scan com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation@4753eb60

Screenshot 2023-12-06 at 13 26 04