GoogleCloudDataproc/spark-bigquery-connector

BigQuery Storage API always returning 200 partitions

Closed this issue · 1 comments

I'm using preferredMinParallelism and maxParallelism successfully, but no matter what I do, I always end up with 200 partitions, regardless of how big the underlying table is -- I've tried with tables as big as 4TiB with the same result.

spark:spark.datasource.bigquery.preferredMinParallelism: "33333"
spark:spark.datasource.bigquery.maxParallelism: "33333"

The message I receive with the following settings is:

Requested 33333 max partitions, but only received 200 from the BigQuery Storage API for session 

Is there some additional config that I am missing?

Hi @rcanzanese ,
Actual number of partitions may be less than the preferredMinParallelism if BigQuery deems the data small enough.
There are quotas on number of partitions per read session as well which restricts the parallelism. Please file a bug with support on increasing the quota for your project.