Canceling context randomly breaks everything
Barben360 opened this issue · 11 comments
There seem to be an important problem of concurrency in go-pg.
When there are concurrent calls and a context gets cancelled in one of them, this seems to break the whole connection pool.
To reproduce it, I have written the following test:
func TestConcurrencyAndUserCancel(t *testing.T) {
const concurrency = 10
const iterations = 1000
const cancelProbability = 0.1
app, stop := InitScenario(t)
defer stop()
ctxBkg := context.Background()
cancelWaitMin := 1 * time.Microsecond
cancelWaitMax := 5 * time.Millisecond
cancelProbabilityScaled := int(cancelProbability * 1000000)
obj := models.SomeObj{}
wg := sync.WaitGroup{}
wg.Add(concurrency)
for i := 0; i < concurrency; i++ {
go func(wg *sync.WaitGroup, i int) {
defer wg.Done()
for j := 0; j < iterations; j++ {
// Are we gonna cancel our call ?
cancelWait := time.Duration(0)
val := rand.Intn(1000000)
if val <= cancelProbabilityScaled {
cancelWait = time.Duration(rand.Int63n(int64(cancelWaitMax-cancelWaitMin))) + cancelWaitMin
t.Logf("Canceling add statement in %v", cancelWait)
}
// Adding object
ctx := ctxBkg
cancel := func() {}
if cancelWait != 0 {
ctx, cancel = context.WithTimeout(ctx, cancelWait)
}
objCopy := obj
err := app.Obj.Add(ctx, &objCopy)
cancel()
if err != nil {
if cancelWait == 0 {
t.Errorf("An error occurred for concurrent %d - iteration %d - call 1: %v", i, j, err)
}
continue
}
// Editing object
objCopy.SomeField = "some string"
err = app.Obj.Edit(ctxBkg, &objCopy)
if err != nil {
t.Errorf("An error occurred for concurrent %d - iteration %d - call 3: %v", i, j, err)
}
// deleting object
err = app.Obj.Delete(ctxBkg, objCopy.Id)
if err != nil {
t.Error(err)
}
}
}(&wg, i)
}
doneChan := make(chan bool)
go func(doneChan chan bool) {
wg.Wait()
close(doneChan)
}(doneChan)
select {
case <-doneChan:
case <-time.After(10 * time.Minute):
t.Error("Test timeout")
}
}
In this test, 10 concurrents are making each 1000 sequential INSERT/EDIT/DELETE (which are all run inside RunInTransaction()
). With a probability of 10%, the context of the transaction of the INSERT is cancelled with a random timeout.
I'm using a pool of 10 connections.
- With a probability of context cancel of 0%, this test always passes, whatever the concurrency is between 1 and 10.
- With a probability of context cancel of 10%, this test breaks some times with a concurrency of 5.
- With a probability of context cancel of 10%, this test breaks nearly all the time with a concurrency of 10.
What happens is that after a context cancel (one of the many, randomly), all subsequent calls of all concurrents break with:
tx.go:97: tx.Rollback failed: pg: Conn is in a bad state: write tcp 172.24.0.4:37304->172.24.0.3:5432: i/o timeout
and the client and server never communicate again.
This is extremely annoying because the whole application may become indefinitely unresponsive until restarted, especially when there is a high charge on it.
Please provide the full program. I need to be able to reproduce this as well.
Please provide the full program. I need to be able to reproduce this as well.
I can't provide my full corporate app of course, but I'll make you a full program reproducing.
"fun" fact, we are also debugging what seems to be the same problem. Every now and then one process gets a connection in this broken state and it will error constantly.
@Barben360 if you did not do it already I recommend setting MaxConnAge
and IdleCheckFrequency
to something reasonable. This way the bad connection will be removed from the pool at some point. This is not a solution but definetely better than nothing while this gets fixed :)
Here is some more evidence from our PG logs:
2021-04-26 07:21:53 UTC:user@db:[7251]:ERROR: canceling statement due to user request
2021-04-26 07:21:53 UTC:user@db:[7251]:STATEMENT: COMMIT
2021-04-26 07:21:53 UTC:user@db:[7251]:ERROR: current transaction is aborted, commands ignored until end of transaction block
2021-04-26 07:21:53 UTC:user@db:[7251]:STATEMENT: INSERT INTO ... RETURNING *
2021-04-26 07:21:53 UTC:user@db:[7251]:ERROR: current transaction is aborted, commands ignored until end of transaction block
7251 is "guilty" process ID (%p) after this block it returns current transaction is aborted
all the time
I produced a ready-to-reproduce code for this issue.
Hope this helps correcting the bug
@Barben360 it is not clear what makes you think that program reproduces the problem. I just see a bunch of errors
context deadline exceeded
here and there - because you cancel context- tx.Rollback failed: pg: Conn is in a bad state - does not indicate a problem
What else am I missing?
@vmihailenco we finally found the problem. We believe the issue is related to the PG cancel request.
With enough bad luck (or traffic/latency) the statement that gets cancelled is not the one causing the slow down but one that is sent afterwards. If the statement getting cancelled is a COMMIT or a ROLLBACK then the connection will go back to the pool and fail instantly with a current transaction is aborted
.
@AnatolyRugalev perhaps you can provide some extra information on how we patched this on our fork, not sure if this is PR-worthy but I am sure sharing more information on the problem will help.
@tbarbugli thanks - thanks makes sense. What do you think about #1885 ?
@Barben360 it is not clear what makes you think that program reproduces the problem. I just see a bunch of errors
context deadline exceeded
here and there - because you cancel context- tx.Rollback failed: pg: Conn is in a bad state - does not indicate a problem
What else am I missing?
Actually the errors which are displayed in the log are only the errors that occur on non-cancelled requests. Cancelled request errors (which are legit) are hidden. This shows that cancelling queries has also impact on later uncancelled queries.
Hey @vmihailenco
As @tbarbugli described, we experienced a problem when in high-concurrency situations cancelRequest
cancels the wrong query (COMMIT/ROLLBACK) which breaks connection until pg
recycles it. The issue that @Barben360 raised here probably has a different cause.
I did some tests and found that @Barben360's test project indeed triggers some errors that we can't see in our pretty old forked version (v10.7.1 + some backports). Smells like a regression to me although it's unrelated to our issue.
This how we solved it in our fork:
// closeIfError prevents putting broken transaction-bound connection back into connection pool.
func (tx *Tx) closeIfError(ctx context.Context, err error, op string) error {
if err != nil {
sticky := tx.db.pool.(*pool.StickyConnPool)
sticky.SetConnError(err) // <- we had to introduce this in order for `Reset` to work
_ = sticky.Reset(ctx)
err = fmt.Errorf("%w (connection closed on %s)", err, op)
}
return err
}
func (tx *Tx) CommitContext(ctx context.Context) error {
ctx = internal.UndoContext(ctx)
_, err := tx.ExecContext(ctx, "COMMIT")
err = tx.closeIfError(ctx, err, "COMMIT")
tx.close()
return err
}
func (tx *Tx) RollbackContext(ctx context.Context) error {
ctx = internal.UndoContext(ctx)
_, err := tx.ExecContext(ctx, "ROLLBACK")
err = tx.closeIfError(ctx, err, "ROLLBACK")
tx.close()
return err
}
We have a test suite which reproduces this problem pretty consistently, I will backport your commit to see if it mitigates the issue on our end
Okay, please try v10.9.3. Your fix looks good, but I think it should be applied more generally...