nats-io/nats-server

Support pausing and resuming consumers

ripienaar opened this issue · 15 comments

Proposed change

Introduce an API on $JS.API.CONSUMER.PAUSE.*.* that takes as request:

type JSApiConsumerPauseRequest struct {
    PauseUntil *time.Time `json:"pause_until,omitempty"`
}

The consumer will set itself in a paused state but continue to handle acks for in-flight messages. No further message deliveries will be done after this point, other than deliveries being inhibited the consumer functions as usual.

If a delay is given a timer will auto-resume the consumer. If no time or a time in the past is given a paused consumer will resume.

Consumer info includes 2 new fields:

  Paused bool `json:"paused,omitempty"`
  PauseRemaining time.Duration `json:"pause_remaining,omitempty"`

The paused state and time time would need to be persisted to the raft layer such that server restarts would not unpause paused consumers. This is done using the consumer configuration that has a new value:

PauseUntil time.Time `json:"pause_until,omitempty"`

When given at create time this creates a paused consumer, it's not updatable at runtime using a configuration update, but the PAUSE api will update this setting. Essentially the only way to change this post-create is with the PAUSE API.

Advisories for pause and unpause to be added on io.nats.jetstream.advisory.v1.consumer_pause with pertinant info

Use case

It is difficult to schedule maintenance on central resources on a large distributed system where 100s or 1000s of clients are accessing data in a stream.

We would like to be able to pause a Consumer such that it appears healthy but just doesnt deliver any messages.

During the pause maintenance can happen and resources accessed by clients will not be under constant pressure, later the stream can be unpaused and work will continue.

This would happen without impacting running clients - other than they would see pending messages in stream info but not get any deliveries.

This would apply to push and pull consumers.

Contribution

No response

Should delay just be a parseable string? "1s", "2h"? If we can't parse we return an error.

Do we want to have maximum and minimums or start simple and add in limits as needed?

We don’t have other cases of such strings in the API it’s also a bit go centric so Duration seems best and let UIs handle it as they wish be it strings like that in CLI or some kind oh picker on web

let’s start simple.

ok, but if we use time.Duration then its nanos, not millis.. But I hear you on consistency..

Indeed - nanos. Will fix.

@neilalexander and @Jarema could you work with @ripienaar and this writeup and schedule this work?

@derekcollison this has been scheduled to start on the 5th of February, with a plan to finish before the 16th of February. @neilalexander will be working on it.

@ripienaar @neilalexander Can I ask for an update of the final design after recent discussions?

from my perspective I think the pause/resume APIs are still the right direction. Details for how we actually implement that in a way thats not massive plumbing in the server is for @neilalexander to comment

I vote it should just be part of the consumer config, with no new API endpoints.

At this point I'd say lets just not add this feature. We can go back and find requirements.

As it stands the few requirements we do have will not be met without these extra APIs, so lets just close the issue and move on.

I thought it would be easier but not impossible, you are saying they would require securing just that functionality vs general update yes? And without general callouts we only have new APIs to secure individually, that correct?

Yes, I think there is a need to cater for 2 distinct users - operational needs and configuration needs. Often configuration may not be changed without approvals by change advisory boards etc.

Doing maintenance should not require a configuration change.

Those doing maintenance should not need to be authorized to do a configuration change.

Capturing a discussion that keeps coming up around this one:

Question: Should the paused until configuration be updatable as configuration?
Answer: We have the pattern where updates to consumer configs are idempotent and as a result applications set their confguration at startup often. We added the action to help distinguish a bit, its problematic though as that is not something one can do authz against today.

Given this pattern the question is who owns this property? If an administrator sets the pause state to x and the app starting up sets it to start-paused or unpaused, how is the system to distinguish between a normal app making the API call to create a paused/unpaused consumer and a admin asking the consumer to be paused?

I dont think the API has the context of who is calling it for what reason and it would be undesirable to allow a unexpected config update by a starting worker to unpause a consumer.

It's essential that the responsibilities of creation and administration be seperate here, it could be created paused - but a administrator must be able to unpause it and know if that creation is run again it will not again be paused. Or if an administrator overrides the pause from 1 hour to 10 minutes that a service startup does not again set it back to 1 hour.

I cant think of a way to capture this distinction (except maybe (ab)using the action property? But see authz comments and about roles and responsibilties). Happy to hear if there's a design solution that both allows this property to be updated as config and the ownership of who has responsibility for its management to be retained.

Related server PR #5066

Server PR has been merged 🎉
Closing the issue.
This feature will be part of release 2.11