GoogleCloudDataproc/spark-bigquery-connector

Big Query Full Table Read - Some Partitions take longer bqScanTime as compared to other partitions with similar number of records

Closed this issue · 4 comments

I am trying to read a large, big query table with 95 million records and 400+ columns.
I am using bqChannelPoolSize=15, with 6 Executors with 11 cores each.
I could see, from Spark UI, that there are 200 Tasks created for the Read Session with more or less equal data (500K records each on an avg), however actual read for certain partitions take much longer (many times of the execution time needed for other similar partitions). Because of these few threads which are taking more time, my job is going beyond 6 hours and getting Read Session Timed out error. I have checked the GCP Logging Service with the Read Session id, but couldn't find anything unusual there, in fact no entries were present for that time frame for the session id.

How can i debug this further? Is there any table level optimizations that i can do for the full table reads (also this happens for the other queries where we are getting just 4 partitions,3 partitions will get finished under 3 minutes and last one will take 3 hours to read same number of records as that of other partitions)

Technical Details
PySpark: 3.0.2, OnPremise Cluster on K8S
Big Query Spark Connector: spark-3.1-bigquery-0.36.1
bqChannelPoolSize=15
maxParallelism received: 200
Spark Conf: 6 Executors with 11 cores

Can you share session IDs for example sessions? These should be in the Spark driver logs, and are strings of the format "projects/.../locations/.../sessions/...".

@kmjung : Unfortunately i cannot share that info as the issue is in Production Big Query Tables and may need to take extra approvals. Is there a way i can debug the issue any further?

It's going to be tough for me to debug without any additional logs here. If you're open to filing a support case with GCP, that might be the easiest way to share this information securely with relevant experts from the BigQuery and Dataproc teams.

Thank you @kmjung. I will do that.