Enhancing query pushdown in Postgres...

Question

Enhancing query pushdown in Postgres...

MrPowers opened this issue 5 years ago · 1 comments

A lot of queries are pushed down to the database level when Snowflake is used as described in this blog post.

Joins, aggregations, and SQL functions are all pushed down and performed in the Snowflake database before data is sent to Spark.

I know some stuff gets pushed down to Postgres (column pruning), but are joins and aggregations being pushed down? @nvander1 - do you know what gets pushed down to Postgres? Is this something we could improve?

Some analyses could do a lot of stuff at the database level, only send a fraction of the data to the Spark cluster, and then probably perform a lot faster. Spark isn't the best at joins, so pushing those down to the database level would probably help a lot...

Answer 1 · 2019-05-19T19:14:38.000Z

For JDBC Spark can only really push down filters like a where clause and maybe column pruning. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12126 It sounds like a lot of the work they are doing on datasource v2 is geared toward writing jdbc sources that are able to do more sophisticated push downs . You can manually give it a sql string however right now Le dim. 19 mai 2019 à 14:03, Matthew Powers <notifications@github.com> a écrit :

…

A lot of queries are pushed down to the database level when Snowflake is used as described in this blog post <https://www.snowflake.com/blog/snowflake-spark-part-2-pushing-query-processing/> . Joins, aggregations, and SQL functions are all pushed down and performed in the Snowflake database before data is sent to Spark. I know some stuff gets pushed down to Postgres (column pruning), but are joins and aggregations being pushed down? @nvander1 <https://github.com/nvander1> - do you know what gets pushed down to Postgres? Is this something we could improve? Some analyses could do a lot of stuff at the database level, only send a fraction of the data to the Spark cluster, and then probably perform a lot faster. Spark isn't the best at joins, so pushing those down to the database level would probably help a lot... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#90?email_source=notifications&email_token=ACXK4V4NCOOQBVDQ2Y7S6SLPWGI6DA5CNFSM4HN5A65KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GUTCZAQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACXK4VZXHBRTL3GWZT62VELPWGI6DANCNFSM4HN5A65A> .