databendlabs/databend

Cache: Query result cache

Closed this issue · 6 comments

Summary

Background

For a read, the main flow is:

  1. Get the source plan: file name(partition file)
  2. Read the files by file name which on object storage(like AWS S3).

With the query result cache, we can do:

  • Step1. Parse the query, and calculate the fingerprint: query_id
  • Step2. Get the source plan(read_plan), and calculate the fingerprint: source_plan_id
  • Step3. Check the cache
    • 3.1 If the cache is exists: /query_id/source_plan_id/result, get and return the result.
    • 3.2 If the cache is not exists, put the result to the cache

Where the cache stored

Storage in the S3, path is /<bucket>/<tenant>/result/cache/, and the user can download it.

How to calculate the fingerprint

  • query_id need based on the AST? select * from t1 where a>1 fingerprint is same select * from t1 where a>1 and 1=1
  • source_plan_id based on the partition file name and the file offset

/assignme

cc @Chasen-Zhang this may help the issues we talked yesterday

The requirement for query result cache and the requirement for data block cache is different:

  • The result cache must be complete, i.e., caching only part of the result is not allowed. The entire result data is added and removed as a whole.
  • The data block cache prefers partial cache: not used data should not be cached.

AFAIK, these two goals may conflict with each other:

  • The result cache may be evicted due to too many block caches.
  • The result cache would be better using a least-recently-added eviction policy(a result is kept for a certain time no matter how often it is read). While block cache would be better using a least-recently-used policy.

This issue have been moved to v0.9

The result cache is finished but not used yet, let's close.