Cache: Query result cache

Question

Closed this issue 8 months ago · 6 comments

Summary

Background

For a read, the main flow is:

With the query result cache, we can do:

Step1. Parse the query, and calculate the fingerprint: query_id
Step2. Get the source plan(read_plan), and calculate the fingerprint: source_plan_id
Step3. Check the cache
- 3.1 If the cache is exists: /query_id/source_plan_id/result, get and return the result.
- 3.2 If the cache is not exists, put the result to the cache

Storage in the S3, path is /<bucket>/<tenant>/result/cache/, and the user can download it.

query_id need based on the AST? select * from t1 where a>1 fingerprint is same select * from t1 where a>1 and 1=1
source_plan_id based on the partition file name and the file offset

/assignme

Answer 1 · 2022-04-15T08:04:44.000Z

cc @Chasen-Zhang this may help the issues we talked yesterday

Answer 2 · 2022-07-20T23:40:46.000Z

Answer 3 · 2022-07-25T05:35:42.000Z

The requirement for query result cache and the requirement for data block cache is different:

The result cache must be complete, i.e., caching only part of the result is not allowed. The entire result data is added and removed as a whole.
The data block cache prefers partial cache: not used data should not be cached.

AFAIK, these two goals may conflict with each other:

The result cache may be evicted due to too many block caches.
The result cache would be better using a least-recently-added eviction policy(a result is kept for a certain time no matter how often it is read). While block cache would be better using a least-recently-used policy.

Answer 4 · 2022-08-15T10:54:49.000Z

This issue have been moved to v0.9

Answer 5 · 2024-03-21T12:56:09.000Z

The result cache is finished but not used yet, let's close.