delta-io/delta-sharing

Databricks fails to process Delta Sharing query using notebook

rustyconover opened this issue · 2 comments

There is a problem with querying Delta Share tables containing many files. I'm using Databricks 11.3 LTS.

I have a table compromised of 156,000 files that I am trying to query via the SQL warehouse and the Python notebook interface.

The SQL query can succeed when utilizing the SQL warehouse interface of Databricks, but when using a Python notebook, the same query never completes. The Databricks Delta sharing client sends the same request to read data from the table to the Delta Sharing server multiple times with the same parameters. It seems like it is just retrying the same request repeatedly without processing the result.

The table in question has not enabled history tracking; the request's body is {predicateHints:[]} when the table does have partition columns referenced by the query.

The SQL query does not need to process all 156,000 files. So many files are returned from the Delta Sharing server because the Databricks client did not include predicate hints.

I've observed this behavior often when Databricks queries tables for the first time. Using partition predicates would change the number of files to be returned to less than <50.

I can share the SQL queries and demonstrate the behavior of Databricks on demand.

Please advise me on how to proceed so this query can run successfully in a notebook.

the Python notebook interface.

Are you using load_as_pandas or load_as_spark?

I think we resolved this by upgrading to a newer version of Databricks.