valyala/fasthttp

The test named TestHostClientMaxConnWaitTimeoutError fails under certain conditions.

newacorn opened this issue · 1 comments

func TestHostClientMaxConnWaitTimeoutSuccess(t *testing.T) {

Reproduce

❯ go test -run=TestHostClientMaxConnWaitTimeoutError -count=2000  -failfast  .
--- FAIL: TestHostClientMaxConnWaitTimeoutError (0.21s)
    client_test.go:3021: connsWait has 1 items remaining
FAIL
FAIL	github.com/valyala/fasthttp	24.859s
FAIL

Cause

fasthttp/client.go

Lines 1692 to 1704 in a1db411

if q := c.connsWait; q != nil && q.len() > 0 {
for q.len() > 0 {
w := q.popFront()
if w.waiting() {
go c.dialConnFor(w)
dialed = true
break
}
}
}
if !dialed {
c.connsCount--
}

fasthttp/client.go

Lines 1747 to 1758 in a1db411

if q := c.connsWait; q != nil && q.len() > 0 {
for q.len() > 0 {
w := q.popFront()
if w.waiting() {
delivered = w.tryDeliver(cc, nil)
break
}
}
}
if !delivered {
c.conns = append(c.conns, cc)
}

These two pieces of code assume that when w.waiting() returns true, we can definitely wake up a valid waiter, and then have that waiter continue to wake up the next one upon exiting.

However, there is a window of time where w.waiting() may return true, but w could time out and set the value of wantConn.err afterward, causing the wake-up chain to be interrupted.

The more serious issue is not whether c.connsWait.len() returns zero after all client calls return, but rather that there are errors in the length of c.conns and changes in the c.connsCount count. This can lead to a situation where, during a connsCount semaphore release, there are still valid waiters in c.connsWait.

Issue 1: The semaphore released for connsCount does not increase the length of c.conns nor does it decrement the c.connsCount count.
Issue 2: Even if the release is reflected in the length of c.conns or the calculation of c.connsCount, the problem lies in that it should have passed the semaphore but instead chose to release it.
Let's first address the connsCount semaphore issue.
Next, resolve the issue with the TestHostClientMaxConnWaitTimeoutError test case.

Thanks for diving into this, and writing two pull requests to fix both issues!