spark-root/laurelin

Support for LIMIT pushdown

ghislainfourny opened this issue · 6 comments

It seems that when trying to read the first N rows, Laurelin materializes the entire table before applying the LIMIT, e.g.:

SELECT * FROM the-name-i-give-to-my-df LIMIT 10

Are there plans to push down the LIMIT to read less from disk? I would expect this query to be very fast even with a very large file.

Thanks!

Thanks, Andrew!

I think that if implement org.apache.spark.sql.connector.read.SupportsReportStatistics, which provides the # of rows and # of bytes, spark will have enough to know upfront the number of rows in a scan, which will let it limit the partitions read

I asked the mailing list yesterday -- http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-LIMIT-in-DSv2-td37166.html . I'll report back if there's any reply, but longer term, I'm planning on implementing the various DataSourcev2 mixins, so that functionality should come up soon

Thanks a lot, Andrew. I am looking forward to their answer. It would be wonderful if Spark supported this (for all possible inputs -- except, that is, when the schema has to be inferred).