buchgr/bazel-remote

Remote Cache Warning GRPC

Opened this issue ยท 3 comments

DEADLINE_EXCEEDED: deadline exceeded after 59.999913100s. [closed=[], open=[[buffered_nanos=33620, ....]]

Hi ๐Ÿ‘‹ We've been using this remote cache backed by s3 and have been recently seeing timeouts in grpc. (blob based s3 storage) with 3 instances of the service running with 4 vcpu. running on version 2.3.9.

Our CPU and memory usage are only 80 & 30% peaks respectively. I don't have a reliable repro for this, but was wondering if you had any insight as to what could be going wrong here?

Do you see any errors in bazel-remote's logs when the client shows these timeouts?

Do you have iowait monitoring on the bazel-remote machine and on the s3 storage (if it's something you're running locally)? If you see iowait spikes when then maybe your storage bandwidth is saturated.

@mostynb, yeah, we see a lot of GRPC BYTESTREAM READ FAILED

our requests increased when we added

coverage --experimental_fetch_all_coverage_outputs
coverage --experimental_split_coverage_postprocessing

Do you have iowait monitoring on the bazel-remote machine and on the s3 storage

I don't believe so, I can take a look here.

In general, is it better to scale vertically or horizontally? Currently we have 3 replicas on ECS

yeah, we see a lot of GRPC BYTESTREAM READ FAILED

Are there any more details provided in the logs besides bytestream read failed and the resource/blob name? If so, could you share a few of them here?

In general, is it better to scale vertically or horizontally? Currently we have 3 replicas on ECS

The REAPIv2 cache service has strong coherence requirements, and bazel doesn't behave nicley when those assumptions fail. eg bazel builds can fail if they make a request to one cache server, then make a request to another server (with a different set of blobs) during the same build. This makes horizontal scaling risky, unless you arrange things in such a way that a given client only talks to a single cache server during a single build.

Evicting items from bazel-remote's proxy backends can also break these assumptions. To avoid this we would need to figure out a way to imlpement some sort of LRU-like eviction for S3 (but I don't have an AWS account to do this work myself).