minio/sidekick

[faq] question about the sidekick cache feature, is it one distributed client side cache?

gwnet opened this issue · 15 comments

gwnet commented

one question to clarify how sidekick minio cache works.
for example there is two clients
ClientA build sidekick client minio cache to the remote minio server
ClientB build another sidekick client minio cache to the same remote minio server.
how the caches are replicated between clientA and clientB.

can you clarify?

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Regression

Your Environment

  • Version used (sidekick version):
  • Environment name and version (e.g. nginx 1.9.1):
  • Server type and version:
  • Operating System and version (uname -a):
  • Link to your project:

ClientA build sidekick client minio cache to the remote minio server
ClientB build another sidekick client minio cache to the same remote minio server.
how the caches are replicated between clientA and clientB.

Caches are not replicated, cache is a centralized shared cache between clients

gwnet commented

@harshavardhana can you clarify a little more detail? from the project introduction, it show me client app and sidekick is deployed at client machine. it can save the network overhead between the client app and minio cache inside sidekick. this is my understanding when I see the diagram. but if you mentioned that it is centralized shared cache, if so it is not one client side cache, it should be one server side cache. that all clients need access the sidekick via network.

gwnet commented

@harshavardhana project main page mentioned that each sidekick is deployed on the each client, for example if clientA modify objectA, how client B get notified that they need invalidate clientB 's cache

@harshavardhana project main page mentioned that each sidekick is deployed on the each client, for example if clientA modify objectA, how client B get notified that they need invalidate clientB 's cache

The cache is centralized @gwnet invalidate automatically happens.

Clients are not caching things independently.

@harshavardhana can you clarify a little more detail? from the project introduction, it show me client app and sidekick is deployed at client machine. it can save the network overhead between the client app and minio cache inside sidekick. this is my understanding when I see the diagram. but if you mentioned that it is centralized shared cache, if so it is not one client side cache, it should be one server side cache. that all clients need access the sidekick via network.

It is never mentioned as client side cache, it is a shared cache.

gwnet commented

@harshavardhana thank you so much. I guess I get it. the cache minio server is deployed with client app cluster, for example the spark cluster. the cache minio server is deployed with distributed way at the spark cluster. so when spark worker read cache minio server via sidekick, minio server will fetch the contents from other nodes inside the cache minio server then reply to spark, this cannot save all the overheads of network. and cache minio server is distributed deployed, the remote minio server is distributed too. is this correct?
so sidekick need deploy on each node of spark cluster, sidekick will send requests to his localhost minio cache server always, right?

@harshavardhana thank you so much. I guess I get it. the cache minio server is deployed with client app cluster, for example the spark cluster. the cache minio server is deployed with distributed way at the spark cluster. so when spark worker read cache minio server via sidekick, minio server will fetch the contents from other nodes inside the cache minio server then reply to spark, this cannot save all the overheads of network. and cache minio server is distributed deployed, the remote minio server is distributed too. is this correct?
so sidekick need deploy on each node of spark cluster, sidekick will send requests to his localhost minio cache server always, right?

cache server is different than the one you are using for your actual data @gwnet - cache server is a more high-performance server serving perhaps an optane SSD like entity which can perform high speed read/writes.

You shouldn't re-purpose your existing distributed cluster of MinIO to cache its own content again using sidekick i.e not an ideal architectural choice and wouldn't give you the performance gain you would get from using caching.

If you do not have fast Optane like SSDs its not worth for you to use caching, MinIO distributed cluster will deliver the necessary performance that you need for the hardware that you have. sidekick will efficiently load balance the incoming requests.

gwnet commented

@harshavardhana
so for this comments, I am confused. what is purpose of sidekick on the distributed cache minio server? from your main page, I see the cache is inside sidekick. could you please help me clarify?

You shouldn't re-purpose your existing distributed cluster of MinIO to cache its own content again using sidekick i.e not an ideal architectural choice and wouldn't give you the performance gain you would get from using caching.

I do have optane and many QLC, I plan put Optane and QLC as the cache minio server. what would be the deployment and IO path, and I want to minio remote server on HDD as tier 2 storage. can you give me one detail IO lifecyle between client, HTTP cache layer, minio cache, sidekick and remote minio?

@harshavardhana
so for this comments, I am confused. what is purpose of sidekick on the distributed cache minio server? from your main page, I see the cache is inside sidekick. could you please help me clarify?

cache is not inside sidekick, sidekick uses an S3 backend as shared cache. This S3 backend preferably MinIO is running on an optane SSD. sidekick is just a smart load balancer to your actual large scale data cluster, to be used as a sidecar application along with the application. For example spark examples provided in the README explain this.

You shouldn't re-purpose your existing distributed cluster of MinIO to cache its own content again using sidekick i.e not an ideal architectural choice and wouldn't give you the performance gain you would get from using caching.

I do have optane and many QLC, I plan put Optane and QLC as the cache minio server. what would be the deployment and IO path, and I want to minio remote server on HDD as tier 2 storage. can you give me one detail IO lifecyle between client, HTTP cache layer, minio cache, sidekick and remote minio?

For detailed architecture guidance we recommend commercial engagements. Reach out to us for more hands on guidance from our website https://min.io/pricing

gwnet commented

let us take one example. I have 4 nodes as spark cluster called clusterSpark, I have another 4 nodes as minio remote server with HDDs, we call it clusterMinio

  1. install the minio server distributed way on clusterMinio, this is real data is in.
  2. install the minio server distributed way on clusterSpark, use Optane + QLC, this is cache minio server
  3. install sidekick on each node of clusterSpark, configure sidekick's cache point to minio server that is setup by step2 above. to save network, each node sidekick can configure his local IP address of the cache minio server.
    is this correct?
    then when IO from spark comes in
  4. IO goes into sidekick first,
  5. sidekick will try get it from cache minio server that is deployed inside spark cluster.
  6. not hit, sidekick will pass IO to its inernal load balance to the remote minio server. so the load balance happen after the cache minio server. the cache minio server is on the front of load balance in the IO stack.
    is this correct understanding?
gwnet commented

@harshavardhana expert, could you please comment my above understanding?

let us take one example. I have 4 nodes as spark cluster called clusterSpark, I have another 4 nodes as minio remote server with HDDs, we call it clusterMinio

  1. install the minio server distributed way on clusterMinio, this is real data is in.
  2. install the minio server distributed way on clusterSpark, use Optane + QLC, this is cache minio server
  3. install sidekick on each node of clusterSpark, configure sidekick's cache point to minio server that is setup by step2 above. to save network, each node sidekick can configure his local IP address of the cache minio server.
    is this correct?
    then when IO from spark comes in
  4. IO goes into sidekick first,
  5. sidekick will try get it from cache minio server that is deployed inside spark cluster.
  6. not hit, sidekick will pass IO to its inernal load balance to the remote minio server. so the load balance happen after the cache minio server. the cache minio server is on the front of load balance in the IO stack.
    is this correct understanding?

👍

gwnet commented

:) @harshavardhana thank you so much man!~

gwnet commented

@harshavardhana hello sidekick cache is one read cache only, right? for all the writes, it will pass through to backend directly?
if so if app is spark or machine learning that need low latency write, how to handle this?

@harshavardhana hello sidekick cache is one read cache only, right? for all the writes, it will pass through to backend directly?
if so if app is spark or machine learning that need low latency write, how to handle this?

spark is read heavy than write heavy @gwnet