[req]: spurious retransmissions
grrtrr opened this issue · 8 comments
After updating go-mangos from e3181ec, we are experiencing spurious retransmissions in the req
code:
- using message-level (i.e. our own) per-message UUIDs (v1, using 100nsec counter),
- we see that each message is sent twice, for no apparent reason.
Example log output (changed/redacted proprietary things):
Dec 22 14:49:10 Warning: [req.go:181] OUT => f1768ed3-e761-11e7-bc48-000c29cbc297 CommandDone
Dec 22 14:49:10 Warning: [req.go:115] RESEND <= f1768ed3-e761-11e7-bc48-000c29cbc297 CommandDone
Dec 22 14:49:10 Warning: [req.go:181] OUT => f1768ed3-e761-11e7-bc48-000c29cbc297 CommandDone
...
Dec 22 14:49:01 Warning: [req.go:115] RESEND <= eb99b1c8-e761-11e7-bc48-000c29cbc297 Update
Dec 22 14:49:01 Warning: [req.go:181] OUT => eb99b1c8-e761-11e7-bc48-000c29cbc297 Update
Dec 22 14:49:01 Warning: [req.go:181] OUT => eb99b1c8-e761-11e7-bc48-000c29cbc297 Update
The second example looks as if resender()
queued the message again before sender()
had a chance to put it on the wire.
It seems to me it has to to with the changes in dc23bea#diff-c85d1cbd91ad6fd70f18d6c17457f908 and thus had instrumented the code as follows:
OUT
appears on the line immediately beforepe.ep.SendMsg(m)
RESEND
appears inresender()
, immediately beforer.resend <-m
gets requeued.
I have another instrumentation right after pe.ep.SendMsg(m)
where it prints the SendMsg()
error (e.g. "tls: use of closed connection"). When that scenario happens, there are
- 2 x OUT messages per UUID (as for non-error case)
- 2 x RESEND messages (all other cases only log the
resender()
requeuing.).
This may have to do with fixing #286, I am trying to get at the root. Hope this is sufficient to locate the changed behaviour. For us the spurious retransmissions cause problems,
since now each message is sent at least twice.
Update
I narrowed it down to AddEndpoint:
- on the first send, without a remote endpoint being present,
r.pend == nil
, - hence it puts
true
into thewakeq
, - which wakes up the
resender()
As far as I understand #286, the problem it describes has only to do with the receiver, i.e. not the sender code above.
I also noted that r.pend is set on each loop iteration, which seemed not to be necessary.
This is my current endpoint (diff below), with this modification I don't see the spurious retransmissions anymore:
func (r *req) AddEndpoint(ep mangos.Endpoint) {
r.init.Do(func() {
r.w.Add()
go r.resender()
})
pe := &reqEp{cq: make(chan struct{}), ep: ep}
r.Lock()
r.eps[ep.GetID()] = pe
r.pend = pe
r.Unlock()
go r.receiver(ep)
r.w.Add()
go r.sender(pe)
}
diff:
--- a/protocol/req/req.go
+++ b/protocol/req/req.go
@@ -159,14 +175,12 @@ func (r *req) sender(pe *reqEp) {
return
}
- r.Lock()
- r.pend = pe
- r.Unlock()
@@ -197,16 +211,12 @@ func (r *req) AddEndpoint(ep mangos.Endpoint) {
})
pe := &reqEp{cq: make(chan struct{}), ep: ep}
+
r.Lock()
r.eps[ep.GetID()] = pe
- // If we were waiting for somewhere to send this message, try it now.
- if r.pend == nil && r.reqmsg != nil {
- select {
- case r.wakeq <- true:
- default:
- }
- }
+ r.pend = pe
r.Unlock()
+
go r.receiver(ep)
r.w.Add()
go r.sender(pe)
Will ponder this for a while and test. Can send PR if this looks good to you.
So looking at this, the old code was trying to ensure that if a message was waiting for an output pipe (because we tried to send before one was available, or the pipe the original send was on closed) we immediately send instead of waiting for the timeout.
You diff loses both that behavior, and the semantic behavior of the ".pend" member. (The .pend member was meant to track that we delivered a message out on the named pipe, and we haven't received it yet.)
I will need to do my own debug. Sorry.
My above conclusions were premature, please see the comments in #286. The starting point was a hanging receiver, which I tracked down to be
an infinite receive timeout.
So yes, my above diff is wrong. I am questioning whether the entire fix for #286 was necessary, since it introduced at least 2 distinct new problems (panic on closed channel, and spurious retransmissions). I am currently testing reverting the fix for #286 entirely, which seems to fix both of these problems. Please let me know your thoughts.
For the record, I would support changing the default queue lengths somewhat -- REQ sockets in particular should not have deep queues unless they are in raw mode. (Raw mode needs them to avoid losing messages in a device configuration.) If it were not for devices, even REP sockets could have short queues.
Retry timeouts need to reflect the worst case time for completion of a job, multiplied by the sum of all the queues in the middle, plus any allowance(s) required to permit multiple REQs to be serviced by a single REP worker. Most of the stuff I've done has assumed "fast" (close to zero) cost-of-effort on REP side. If your REP server is doing significant computational work, or is blocking on other things (disk I/O, database lookups, etc.) then more tuning will be required. The documentation could be a bit more helpful here.
But queue length tuning is sort of out of scope for this particular problem, which is that RETRY does not get sent immediately.
I'm leaning more towards backing out the change and refactoring it, before reintegration.
I have a theory that the problem may be the slight buffering of the wakeq -- meaning multiple "wakes" are getting queued, causing the resender to wake up multiple times. Stay tuned.
(What I think is that I can make the wakeq unbuffered, which is probably what I should have done all along. I was trying to emulate a condition variable. I wish sometimes I had not built this entire framework around go channels....)