Cache: Query result cache
Closed this issue · 6 comments
BohuTANG commented
Summary
Background
For a read, the main flow is:
- Get the source plan: file name(partition file)
- Read the files by file name which on object storage(like AWS S3).
With the query result cache, we can do:
- Step1. Parse the query, and calculate the fingerprint:
query_id
- Step2. Get the source plan(
read_plan
), and calculate the fingerprint:source_plan_id
- Step3. Check the cache
- 3.1 If the cache is exists:
/query_id/source_plan_id/result
, get and return the result. - 3.2 If the cache is not exists, put the result to the cache
- 3.1 If the cache is exists:
Where the cache stored
Storage in the S3, path is /<bucket>/<tenant>/result/cache/
, and the user can download it.
How to calculate the fingerprint
query_id
need based on the AST?select * from t1 where a>1
fingerprint is sameselect * from t1 where a>1 and 1=1
source_plan_id
based on the partition file name and the file offset
youngsofun commented
/assignme
flaneur2020 commented
cc @Chasen-Zhang this may help the issues we talked yesterday
BohuTANG commented
cc @drmingdrmer
drmingdrmer commented
The requirement for query result cache and the requirement for data block cache is different:
- The result cache must be complete, i.e., caching only part of the result is not allowed. The entire result data is added and removed as a whole.
- The data block cache prefers partial cache: not used data should not be cached.
AFAIK, these two goals may conflict with each other:
- The result cache may be evicted due to too many block caches.
- The result cache would be better using a least-recently-added eviction policy(a result is kept for a certain time no matter how often it is read). While block cache would be better using a least-recently-used policy.
Xuanwo commented
This issue have been moved to v0.9
BohuTANG commented
The result cache is finished but not used yet, let's close.