The default backoff policy does not have mandatory stop, which could cause unexpected send timeout
BewareMyPower opened this issue · 5 comments
Expected behavior
For a producer, when the retry time exceeds the left time when a message would fail with timeout error, it should be reduced to be less than the left time.
See ProducerImpl
's constructor in Java client:
this.connectionHandler = new ConnectionHandler(this,
new BackoffBuilder()
.setInitialTime(client.getConfiguration().getInitialBackoffIntervalNanos(), TimeUnit.NANOSECONDS)
.setMax(client.getConfiguration().getMaxBackoffIntervalNanos(), TimeUnit.NANOSECONDS)
.setMandatoryStop(Math.max(100, conf.getSendTimeoutMs() - 100), TimeUnit.MILLISECONDS)
.create(),
There is a mandatoryStop
field.
Actual behavior
The default backoff policy simply increases the next delay until the max retry time (60s).
Steps to reproduce
func TestBackoff_NextMinValue(t *testing.T) {
backoff := &DefaultBackoff{}
for i := 0; i < 10; i++ {
delay := backoff.Next()
fmt.Println(delay)
}
}
Modify the test above in backoff_test.go
and run go test -run TestBackoff_NextMinValue
in the same directory.
111.555037ms
213.388077ms
466.046518ms
940.55396ms
1.614542567s
3.383705308s
6.806575903s
15.159677039s
25.637999048s
1m0.834600978s
System configuration
Pulsar version: x.y
Go and Java design are different. In the Go, we have a gorotinue to check the timeout.
Maybe your application needs to wait for 60 seconds, is it right? I'm improving this: #1256
What problem have you encountered?
There is a case that the server side took 25 seconds to recover while the client side failed with timeout.
we have a gorotinue to check the timeout.
I didn't mean the Golang client does not check the timeout, let's take a look at the following case (reusing the delay I ran before)
- 0.000: reconnect after 0.111s
- 0.111: reconnect after 0.213s
- 0.324: reconnect after 0.466s
- 0.790: reconnect after 0.940s
- 1.730: reconnect after 1.614s
- 3.344: reconnect after 3.383s
- 6.727: reconnect after 6.806s
- 13.533: reconnect after 15.159s
- 28.692: reconnect after 25.637s
- 29.500: assume the server was recovered at this moment
- 30.000: the message in the pending queue failed with timeout
- 54.329: the reconnection succeeds. However, the message failed 20 seconds ago
Let's see how Java client solves the issue.
.setMandatoryStop(Math.max(100, conf.getSendTimeoutMs() - 100), TimeUnit.MILLISECONDS)
Backoff backoff = new Backoff(100, TimeUnit.MILLISECONDS, 60, TimeUnit.SECONDS, 30000 - 100, TimeUnit.MILLISECONDS);
for (int i = 0; i < 10; i++) {
System.out.println(backoff.next());
}
100
195
398
799
1525
3136
6272
12157
24279
29355
Though it's a little different than I've thought. But generally, having a delay greater than the send timeout does not make sense.
It looks like you want to reduce the maximum retry time and retry interval.
The problem I see with this change is that it allows misconfigured clients that retry too quickly, to overwhelm brokers with a flood of requests.
Please see #454
We can consider https://aws.amazon.com/cn/blogs/architecture/exponential-backoff-and-jitter/ design.
Yes I want to limit the max retry time according to the send timeout config.
The exponential backoff mechanism is good that we can support it in future. But currently we'd better adopt the same behavior with the Java client.
Ok, I see. We need to reduce the retry interval, not limit the max retry time, this is the root cause.
https://github.com/cenkalti/backoff/blob/v4/exponential.go#L39-L50 provides an example, could you checkout?
BTW, the setMandatoryStop
seems equal to the max retry, is it right?