Remote Cache Warning GRPC
Opened this issue ยท 3 comments
DEADLINE_EXCEEDED: deadline exceeded after 59.999913100s. [closed=[], open=[[buffered_nanos=33620, ....]]
Hi ๐ We've been using this remote cache backed by s3 and have been recently seeing timeouts in grpc. (blob based s3 storage) with 3 instances of the service running with 4 vcpu. running on version 2.3.9.
Our CPU and memory usage are only 80 & 30% peaks respectively. I don't have a reliable repro for this, but was wondering if you had any insight as to what could be going wrong here?
Do you see any errors in bazel-remote's logs when the client shows these timeouts?
Do you have iowait monitoring on the bazel-remote machine and on the s3 storage (if it's something you're running locally)? If you see iowait spikes when then maybe your storage bandwidth is saturated.
@mostynb, yeah, we see a lot of GRPC BYTESTREAM READ FAILED
our requests increased when we added
coverage --experimental_fetch_all_coverage_outputs
coverage --experimental_split_coverage_postprocessing
Do you have iowait monitoring on the bazel-remote machine and on the s3 storage
I don't believe so, I can take a look here.
In general, is it better to scale vertically or horizontally? Currently we have 3 replicas on ECS
yeah, we see a lot of GRPC BYTESTREAM READ FAILED
Are there any more details provided in the logs besides bytestream read failed and the resource/blob name? If so, could you share a few of them here?
In general, is it better to scale vertically or horizontally? Currently we have 3 replicas on ECS
The REAPIv2 cache service has strong coherence requirements, and bazel doesn't behave nicley when those assumptions fail. eg bazel builds can fail if they make a request to one cache server, then make a request to another server (with a different set of blobs) during the same build. This makes horizontal scaling risky, unless you arrange things in such a way that a given client only talks to a single cache server during a single build.
Evicting items from bazel-remote's proxy backends can also break these assumptions. To avoid this we would need to figure out a way to imlpement some sort of LRU-like eviction for S3 (but I don't have an AWS account to do this work myself).