go-pg/pg

Canceling context randomly breaks everything

Barben360 opened this issue · 11 comments

There seem to be an important problem of concurrency in go-pg.

When there are concurrent calls and a context gets cancelled in one of them, this seems to break the whole connection pool.

To reproduce it, I have written the following test:

func TestConcurrencyAndUserCancel(t *testing.T) {
	const concurrency = 10
	const iterations = 1000
	const cancelProbability = 0.1

	app, stop := InitScenario(t)
	defer stop()
	
	ctxBkg := context.Background()

	cancelWaitMin := 1 * time.Microsecond
	cancelWaitMax := 5 * time.Millisecond

	cancelProbabilityScaled := int(cancelProbability * 1000000)

	obj := models.SomeObj{}

	wg := sync.WaitGroup{}
	wg.Add(concurrency)
	for i := 0; i < concurrency; i++ {
		go func(wg *sync.WaitGroup, i int) {
			defer wg.Done()
			for j := 0; j < iterations; j++ {
				// Are we gonna cancel our call ?
				cancelWait := time.Duration(0)
				val := rand.Intn(1000000)
				if val <= cancelProbabilityScaled {
					cancelWait = time.Duration(rand.Int63n(int64(cancelWaitMax-cancelWaitMin))) + cancelWaitMin
					t.Logf("Canceling add statement in %v", cancelWait)
				}

				// Adding object
				ctx := ctxBkg
				cancel := func() {}
				if cancelWait != 0 {
					ctx, cancel = context.WithTimeout(ctx, cancelWait)
				}
				objCopy := obj
				err := app.Obj.Add(ctx, &objCopy)
				cancel()
				if err != nil {
					if cancelWait == 0 {
						t.Errorf("An error occurred for concurrent %d - iteration %d - call 1: %v", i, j, err)
					}
					continue
				}

				// Editing object
				objCopy.SomeField = "some string"
				err = app.Obj.Edit(ctxBkg, &objCopy)
				if err != nil {
					t.Errorf("An error occurred for concurrent %d - iteration %d - call 3: %v", i, j, err)
				}

				// deleting object
				err = app.Obj.Delete(ctxBkg, objCopy.Id)
				if err != nil {
					t.Error(err)
				}
			}
		}(&wg, i)
	}
	doneChan := make(chan bool)
	go func(doneChan chan bool) {
		wg.Wait()
		close(doneChan)
	}(doneChan)

	select {
	case <-doneChan:
	case <-time.After(10 * time.Minute):
		t.Error("Test timeout")
	}
}

In this test, 10 concurrents are making each 1000 sequential INSERT/EDIT/DELETE (which are all run inside RunInTransaction()). With a probability of 10%, the context of the transaction of the INSERT is cancelled with a random timeout.

I'm using a pool of 10 connections.

  • With a probability of context cancel of 0%, this test always passes, whatever the concurrency is between 1 and 10.
  • With a probability of context cancel of 10%, this test breaks some times with a concurrency of 5.
  • With a probability of context cancel of 10%, this test breaks nearly all the time with a concurrency of 10.

What happens is that after a context cancel (one of the many, randomly), all subsequent calls of all concurrents break with:

tx.go:97: tx.Rollback failed: pg: Conn is in a bad state: write tcp 172.24.0.4:37304->172.24.0.3:5432: i/o timeout

and the client and server never communicate again.

This is extremely annoying because the whole application may become indefinitely unresponsive until restarted, especially when there is a high charge on it.

Please provide the full program. I need to be able to reproduce this as well.

Please provide the full program. I need to be able to reproduce this as well.

I can't provide my full corporate app of course, but I'll make you a full program reproducing.

"fun" fact, we are also debugging what seems to be the same problem. Every now and then one process gets a connection in this broken state and it will error constantly.

@Barben360 if you did not do it already I recommend setting MaxConnAge and IdleCheckFrequency to something reasonable. This way the bad connection will be removed from the pool at some point. This is not a solution but definetely better than nothing while this gets fixed :)

Here is some more evidence from our PG logs:

2021-04-26 07:21:53 UTC:user@db:[7251]:ERROR:  canceling statement due to user request
2021-04-26 07:21:53 UTC:user@db:[7251]:STATEMENT:  COMMIT
2021-04-26 07:21:53 UTC:user@db:[7251]:ERROR:  current transaction is aborted, commands ignored until end of transaction block
2021-04-26 07:21:53 UTC:user@db:[7251]:STATEMENT:  INSERT INTO ... RETURNING *
2021-04-26 07:21:53 UTC:user@db:[7251]:ERROR:  current transaction is aborted, commands ignored until end of transaction block

7251 is "guilty" process ID (%p) after this block it returns current transaction is aborted all the time

@vmihailenco

I produced a ready-to-reproduce code for this issue.

Hope this helps correcting the bug

@Barben360 it is not clear what makes you think that program reproduces the problem. I just see a bunch of errors

  • context deadline exceeded here and there - because you cancel context
  • tx.Rollback failed: pg: Conn is in a bad state - does not indicate a problem

What else am I missing?

@vmihailenco we finally found the problem. We believe the issue is related to the PG cancel request.

With enough bad luck (or traffic/latency) the statement that gets cancelled is not the one causing the slow down but one that is sent afterwards. If the statement getting cancelled is a COMMIT or a ROLLBACK then the connection will go back to the pool and fail instantly with a current transaction is aborted.

@AnatolyRugalev perhaps you can provide some extra information on how we patched this on our fork, not sure if this is PR-worthy but I am sure sharing more information on the problem will help.

@tbarbugli thanks - thanks makes sense. What do you think about #1885 ?

@Barben360 it is not clear what makes you think that program reproduces the problem. I just see a bunch of errors

  • context deadline exceeded here and there - because you cancel context
  • tx.Rollback failed: pg: Conn is in a bad state - does not indicate a problem

What else am I missing?

Actually the errors which are displayed in the log are only the errors that occur on non-cancelled requests. Cancelled request errors (which are legit) are hidden. This shows that cancelling queries has also impact on later uncancelled queries.

Hey @vmihailenco

As @tbarbugli described, we experienced a problem when in high-concurrency situations cancelRequest cancels the wrong query (COMMIT/ROLLBACK) which breaks connection until pg recycles it. The issue that @Barben360 raised here probably has a different cause.

I did some tests and found that @Barben360's test project indeed triggers some errors that we can't see in our pretty old forked version (v10.7.1 + some backports). Smells like a regression to me although it's unrelated to our issue.

This how we solved it in our fork:

// closeIfError prevents putting broken transaction-bound connection back into connection pool.
func (tx *Tx) closeIfError(ctx context.Context, err error, op string) error {
	if err != nil {
		sticky := tx.db.pool.(*pool.StickyConnPool)
		sticky.SetConnError(err) // <- we had to introduce this in order for `Reset` to work
		_ = sticky.Reset(ctx)
		err = fmt.Errorf("%w (connection closed on %s)", err, op)
	}
	return err
}

func (tx *Tx) CommitContext(ctx context.Context) error {
	ctx = internal.UndoContext(ctx)
	_, err := tx.ExecContext(ctx, "COMMIT")
	err = tx.closeIfError(ctx, err, "COMMIT")
	tx.close()
	return err
}

func (tx *Tx) RollbackContext(ctx context.Context) error {
	ctx = internal.UndoContext(ctx)
	_, err := tx.ExecContext(ctx, "ROLLBACK")
	err = tx.closeIfError(ctx, err, "ROLLBACK")
	tx.close()
	return err
}

We have a test suite which reproduces this problem pretty consistently, I will backport your commit to see if it mitigates the issue on our end

Okay, please try v10.9.3. Your fix looks good, but I think it should be applied more generally...