Support for LIMIT pushdown
ghislainfourny opened this issue · 6 comments
It seems that when trying to read the first N rows, Laurelin materializes the entire table before applying the LIMIT, e.g.:
SELECT * FROM the-name-i-give-to-my-df LIMIT 10
Are there plans to push down the LIMIT to read less from disk? I would expect this query to be very fast even with a very large file.
Thanks!
Thanks, Andrew!
I think that if implement org.apache.spark.sql.connector.read.SupportsReportStatistics
, which provides the # of rows and # of bytes, spark will have enough to know upfront the number of rows in a scan, which will let it limit the partitions read
I asked the mailing list yesterday -- http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-LIMIT-in-DSv2-td37166.html . I'll report back if there's any reply, but longer term, I'm planning on implementing the various DataSourcev2 mixins, so that functionality should come up soon
Thanks a lot, Andrew. I am looking forward to their answer. It would be wonderful if Spark supported this (for all possible inputs -- except, that is, when the schema has to be inferred).