optionally memoize filter -> rowgroup mappings?
cldellow opened this issue · 1 comments
In the census dataset, a query for city == 'Dawson Creek'
isn't able to effectively prune row groups, since they're not sorted. So a rowgroup may have min=Abbotsford, max=Vancouver, those are the only two cities in the row group, but we still decode the row group. :(
eg, if we split the data into ~256 rowgroups of 50k each, we prune only 12 rowgroups. We examine 244, but only 1 of those is responsive. (If we used the low-level page API, we could do something similar directly using the parquet dictionary, but apache/arrow#1426 (comment) makes me think that's going to be beyond my skills.)
So, it might be nice if we detected this and memoized the mapping for future queries. Optionally, we could persist it to disk, although then we'd have to invalidate the mapping if the parquet file changed.
Goal:
select count(*) from stats where city = 'Dawson Creek'
should be near instantaneous on second queryselect count(*) from stats where city = 'Dawson Creek' and year = 2016
should be detected as a subset of the first query and also instantaneous
Maybe each clause should get its own mappings and we can flexibly AND them together at query time?
Implementation note: sqlite may bail before scanning the entire dataset (eg, if a LIMIT
clause is present) -- we should either only store the mapping when we detect that a full scan was done, or store a marker indicating which rowgroup to resume processing at.
We could create a shadow table in sqlite to store this metadata; then people can opt into/out of persistence by choosing to use an in-memory or on-disk db.