buildbarn/bb-storage

Support highly available deployments

joeljeske opened this issue · 11 comments

It would be very valuable to support either an active-active or even active-passive deployments to provide a HA deployment configuration option. From my reading of the other blobstore options, I do not see a configuration set that would provide HA characteristics, especially for a local backend.

// Store blobs in two backends. Blobs present in exactly one backend
// are automatically replicated to the other backend.
//
// This backend does not guarantee high availability, as it does not
// function in case one backend is unavailable. Crashed backends
// need to be replaced with functional empty instances. These will
// be refilled automatically.
MirroredBlobAccessConfiguration mirrored = 14;

It appears that the mirrored blobstore configuration does not provide highly available deployments, although its not clear to me why. This would seem like the most likely candidate for HA and would provide active-active configuration.

My goal is to be able to upgrade storage nodes with zero downtime. Without backing from an external store (e.g. S3), I am not seeing an option that would enable zero downtime storage deployments.

Do you have alternative suggestions?

It appears that the mirrored blobstore configuration does not provide highly available deployments, although its not clear to me why.

It's because clients such as Bazel are fairly picky about how accurate calls like FindMissingBlobs() are. If you're using features like --remote_download_minimal, then you cannot lose any blobs during the lifetime of the build. If a client calls GetActionResult(), then the objects referenced by the ActionResult must remain present during the remainder of the build.

If you were to have a best effort MirroredBlobAccess, then there is no way you can ensure this constraint is met. With small amounts of network flakiness you may run into situations where a blob only gets written to storage backend A, but a subsequent FindMissingBlobs() is sent to backend B. By enforcing that Write() and FindMissingBlobs() always go to both halves, we ensure that replies remain consistent.

Please refer to this section of ADR#2, which covers this topic in detail:
https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md#storage-requirements-of-bazel-and-buildbarn

My goal is to be able to upgrade storage nodes with zero downtime. Without backing from an external store (e.g. S3), I am not seeing an option that would enable zero downtime storage deployments.

Zero downtime? Sure, that's not possible with the current set of tools. Brief downtime? That is most certainly possible with MirroredBlobAccess. And brief downtime is good enough, because you can just set --remote_retries on the client side, so that it stalls for a bit while the upgrade takes place. You want to pass in that option regardless, as network flakiness may also occur in the path leading up to the build cluster.

I've been operating a setup for the last two years where I alternate rollouts between the A and B halves. Week n, I upgrade the A nodes. Week n+1, I upgrade the B nodes. MirroredBlobAccess has been carefully designed, so that it's fine with one half losing its data. This means that storage nodes don't even need to be persistent. Every time I do an upgrade, I just replace half of the nodes with empty ones and let replication take care of the rest. The last year or so I even stopped doing maintenance announcements, because I know we can do it without people noticing it.

This approach deals well with unexpected note failures. I just rely on Kubernetes to quickly spin up a replacement pod on some spare node in case node/pod health checks fail. I think that over the last couple of years availability has been around 99.97% (including maintenance), which I think is "HA enough" for a build cluster.

One concern people also have is the cost of LocalBlobAccess on EC2, compared to using an S3 bucket. Sure, it's more expensive per byte, but that's only a fraction of the costs of a Buildbarn cluster. An is4gen.8xlarge storage node would be capable of holding 30 TB of data. It is cheaper than a single r5.24xlarge worker node, and you would only need a few of them relative to the number of worker nodes. Furthermore, because LocalBlobAccess has (in my experience) a far lower latency than S3, you end up making better use of the worker nodes' time. The cost is therefore negligible.

Yes I had read that ADR, thank you for writing it.

When churning "A" half of the mirror with empty instances, don't you also run into situations where "A" had received a blob but perhaps "B" did not due to flake? In this setup, we do rely on each half being correct and complete.

If we are implicitly relying on each node having a complete set, we might as well allow clients to continue to receive blobs from the storage when on side of mirror is down. We can still rely on replication to provide fill in missing blobs in "A" from the time that it was offline.

Of course, you should not churn both sides of storage without considerable time between to allow for natural replication, like a week as you mentioned.

When churning "A" half of the mirror with empty instances, don't you also run into situations where "A" had received a blob but perhaps "B" did not due to flake? In this setup, we do rely on each half being correct and complete.

Every time FindMissingBlobs() is invoked, it is forwarded to both backends. MirroredBlobAccess compares the responses to both backends, and does the following:

All of this is done in a blocking manner. This means that by the time the caller gets the results, it has the guarantee that data is stored on both replicas.

(Note that GetActionResult() against the AC does a FindMissingBlobs() against the CAS, using CompletenessCheckingBlobAccess. Thereby guaranteeing that objects referenced from AC entries are also present on both replicas.)

Oh I see, I was under the impression that replication could occur asynchronously, perhaps by the bb_replicator service.

Do you have anecdotal evidence to support a suitable duration for which storage may be unavailable? I'm wondering if Bazel might use up all of its retries while we wait for a single node to come online. Are we relying on the bb-storage network request timing out to give the new node enough time to come online? I do not see any interval or back-off time limit for Bazel when settings --remote_retries.

I was considering using persistent local bb-storage configuration, as opposed to transient storage. I was supposing that it would put less strain on the storage nodes and it would allow a node restart to not require JIT re-population before servicing build jobs. Do you have a suggestion?

Oh I see, I was under the impression that replication could occur asynchronously, perhaps by the bb_replicator service.

It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.

Do you have anecdotal evidence to support a suitable duration for which storage may be unavailable? I'm wondering if Bazel might use up all of its retries while we wait for a single node to come online. Are we relying on the bb-storage network request timing out to give the new node enough time to come online?

This may sound oddly specific, but I have observed that if an EKS node running bb_storage dies, it takes about eight minutes for EKS to detect it, mark the node as down, and evict the pods. So my advice would be to let Bazel retry for at least 15-30 minutes.

I do not see any interval or back-off time limit for Bazel when settings --remote_retries.

Yeah, Bazel is a bit limited in that regard. Bazel's exponential backoff uses an interval of at most five seconds, meaning that if you set --remote_retries=360, then you have the guarantee that it retries for at least 30 minutes. It would have been easier if Bazel had some kind of --remote_retry_duration flag, but it does not. Be sure to file an issue upstream!

I was considering using persistent local bb-storage configuration, as opposed to transient storage. I was supposing that it would put less strain on the storage nodes and it would allow a node restart to not require JIT re-population before servicing build jobs. Do you have a suggestion?

The amount of strain tends to be pretty small. The hot part of the CAS tends to be relatively small, meaning that after a restart you'll only see high loads on the storage nodes for five or so minutes. Making them ephemeral has the advantage that you can just schedule them as Kubernetes 'deployments' instead of 'daemonsets', meaning that there's a higher probability Kubernetes is able to start them for you.

It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.

Interesting, I didn't realize this. If the load on bb-storage is minimal after coming online, do you have a recommendation on when to use a bb_replicator in terms of number of workers?

This may sound oddly specific, but I have observed that if an EKS node running bb_storage dies, it takes about eight minutes for EKS to detect it, mark the node as down, and evict the pods. So my advice would be to let Bazel retry for at least 15-30 minutes.

That is odd. 8 mins sounds very long for k8s to detect an unhealthy node. Do think the health checks are not tuned properly? Or perhaps are you seeing bb_storage nodes misbehave a while before their health checks start failing? 30 minutes sounds like a long time to attempt retries, I was hoping to have this be as minimal as possible perhaps < 2-3mins. An empty storage node in k8s should be very faster startup, especially if the k8s cluster is sized appropriately and does not need a new EC2 to come online. I'll play around with this and see what it is in my case.

--remote_retries=360

Wow! Have you observed any worst-case scenarios, like if a network call hangs instead of fails quickly (e.g hangs for 1 min), would the build wait for 6 hours before failing? I am using --keep_going which would theoretically make me worried that Bazel would try to traverse each available edge in the DAG and wait for a long time until exhausting its retries and finally fail.

Thanks for the information, Since buildbarn has many various configuration options, its helpful to understand how you have productionalized it, and which is the best path to start down.

Related to the concern of making storage HA, I am wondering if you have attempted to make the scheduler more available or durable. I do realize that bazel --remote_retries might be your answer here as well, but it would still be nice if all runners would not lose their work-in-progress if a scheduler dies. This is especially painful for long running actions.

It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.

Interesting, I didn't realize this. If the load on bb-storage is minimal after coming online, do you have a recommendation on when to use a bb_replicator in terms of number of workers?

I think that for any reasonable production setup it's a requirement to set this up. Bazel tends to issue many FindMissingBlobs() calls that have overlapping sets of objects, so even a single Bazel client with a high --jobs could trigger an undesirable amount of replication traffic if not deduplicated.

This may sound oddly specific, but I have observed that if an EKS node running bb_storage dies, it takes about eight minutes for EKS to detect it, mark the node as down, and evict the pods. So my advice would be to let Bazel retry for at least 15-30 minutes.

That is odd. 8 mins sounds very long for k8s to detect an unhealthy node. Do think the health checks are not tuned properly?

Yes, I am fairly certain of that. Unfortunately, EKS doesn't allow you to tune parameters on the API server. So there is no way to tighten node check intervals/timeouts. We have to rely on EC2 terminating the instance, causing a lifecycle hook event to be generated. Only then will EKS start to evict pods.

--remote_retries=360

Wow! Have you observed any worst-case scenarios, like if a network call hangs instead of fails quickly (e.g hangs for 1 min), would the build wait for 6 hours before failing?

Yeah, that sometimes happens. Most of the times we see calls fail quickly, fortunately. Buildbarn does retrying sparingly, because it simply wants the client to do the retries (so that clients don't observe mysterious hangs). As I said, I think Bazel should just have a --remote_retry_duration or something.

Related to the concern of making storage HA, I am wondering if you have attempted to make the scheduler more available or durable. I do realize that bazel --remote_retries might be your answer here as well, but it would still be nice if all runners would not lose their work-in-progress if a scheduler dies. This is especially painful for long running actions.

Yes, exactly. It can be painful, but at least for my setup, scheduler restarts tend to be rare (once a week, when I do upgrades). I'd rather have some delays now and then than making the scheduler's design more complex (and potentially slower as a result of it). Letting the scheduler be non-persistent has the advantage that it's hard to get it stuck due to corrupted state. If there's a bug that causes it to crash now and then, that's fine. We can just fix it as part of the next release.

@EdSchouten thanks for all the clarifications, I do have a few more questions I'd like to ask. Perhaps I should close this ticket and move to the buildbarn slack channel instead...

One point of clarification;

It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.

Is this stating that FindMissingBlobs will block until the bb_replicator has finished replicating all the items? Does this negatively affect performance when a new empty bb_storage CAS comes online?

Also, I'm considering putting ISCC and AC in Redis, mostly for reliability purposes. I've seen that you stated that it can take up to a month for the ISCC to start stabilizing, and I'd be very concerned about inadvertently losing this data. A managed redis service with backups seems like it could be a good option. Are there any drawbacks that would be concerning to you? Do you have regular backups of the ISCC store?

Also, I did end up creating bazelbuild/bazel#16058 to add --remote_retry_max_delay. We'll see if we can get it reviewed and landed.. :)

It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.

Is this stating that FindMissingBlobs will block until the bb_replicator has finished replicating all the items? Does this negatively affect performance when a new empty bb_storage CAS comes online?

Yes, it will. But considering that the 'hot set' of the build tends to be relatively small compared to the full size of the CAS, performance should come back relatively quickly. I've had great experiences using the following replicator configuration on the bb_replicator side:

            replicator: { deduplicating: { concurrencyLimiting: {
              base: { 'local': {} },
              maximumConcurrency: 100,
            } } },

Also, I'm considering putting ISCC and AC in Redis, mostly for reliability purposes. I've seen that you stated that it can take up to a month for the ISCC to start stabilizing, and I'd be very concerned about inadvertently losing this data. A managed redis service with backups seems like it could be a good option. Are there any drawbacks that would be concerning to you?

No drawbacks that I can think of. That should likely work all right.

Do you have regular backups of the ISCC store?

I don't. I mean, it's important enough that I tend to mirror it, but I don't take any offline backups.

I'm going ahead and close this, considering that it's unlikely that any concrete work items are coming out of it. Feel free to open new issues if needed.