porsager/postgres

ECONNRESET errors after a while

karlhorky opened this issue · 17 comments

Hi @porsager !

Similar to #43, I'm also encountering similar ECONNRESET errors with a PostgreSQL service on Render.com, but with a very low number of connections:

Apr 27 10:22:53 PM  node:internal/process/promises:246
Apr 27 10:22:53 PM            triggerUncaughtException(err, true /* fromPromise */);
Apr 27 10:22:53 PM            ^
Apr 27 10:22:53 PM  
Apr 27 10:22:53 PM  Error: read ECONNRESET    at TCP.onStreamRead (node:internal/stream_base_commons:211:20) {
Apr 27 10:22:53 PM    errno: -104,
Apr 27 10:22:53 PM    code: 'ECONNRESET',
Apr 27 10:22:53 PM    syscall: 'read'
Apr 27 10:22:53 PM  }
Apr 27 10:22:54 PM  error Command failed with exit code 1.

It seems like after a certain amount of time, the connection with PostgreSQL goes into this state where the ECONNRESET errors start happening. Once it's in that state, it fails all queries for a while (a few minutes?). But then it eventually "wakes up" again (I guess Postgres.js eventually figured it out and reconnected at that point).

I'm using postgres@2.0.0-beta.5 - I'm assuming this beta version is not super unstable like this...? I never had any issues with this with Heroku's PostgreSQL service.

I've also asked for assistance from the Render.com support team over here: https://community.render.com/t/econnreset-with-node-js-and-postgresql/1498

If the connection is being terminated after a while of inactivity, maybe another option would be to try the idle_timeout option...

Seems like Render.com is still on PostgreSQL 11.7 (maybe to be upgraded in the summer), although I guess that would not be the cause...

Ok, some updates:

  1. connect_timeout: 10 had no effect
  2. idle_timeout: 2 has the service working again now 🤯

So I guess Render.com terminates idle PostgreSQL connections on the server side... I'll ask about this on the community page.

Maybe a good thing to add to the Postgres.js docs too, in case this is a more common configuration...

Hi Karl.

Yeah, this is probably a common scenario.

Could be included with the part you already contributed about lambdas etc. 🙂 https://github.com/porsager/postgres#idle-timeout

Ok, I've added a quick note there then: #180

I guess closing this issue for now, since there's nothing else that's really actionable unless there's more info from the Render.com support team.

So @joshi4 from the Render.com team helped with debugging the issue more on the topic I opened on the community site.

Copying here for visibility:

...the server will close the connection when the server hasn’t seen any activity for apprx 2 hours and 11 minutes. That number comes from a combination of the following kernel settings:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200 

If there’s no activity on a TCP connection for 2 hours then the server will send 9 keep alive probes every 75 seconds. If the connection peer(client) doesn’t reply to any of the probes the server will terminate the connection. It seems that replying to the empty keep alive ACK is optional in the spec.

To test this out, I issued requests to the db 3 hours apart and noticed a bunch of ECONNRESET in my logs.

May 07 19:49:20Z test-conn-reset worker-95lpp Error Error: read ECONNRESET    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
May 07 19:49:20Z test-conn-reset worker-95lpp   errno: -104,
May 07 19:49:20Z test-conn-reset worker-95lpp }
May 07 19:49:20Z test-conn-reset worker-95lpp Error Error: read ECONNRESET    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
May 07 19:49:20Z test-conn-reset worker-95lpp   code: 'ECONNRESET',
May 07 19:49:20Z test-conn-reset worker-95lpp   errno: -104,
May 07 19:49:20Z test-conn-reset worker-95lpp   code: 'ECONNRESET',
May 07 19:49:20Z test-conn-reset worker-95lpp }
May 07 19:49:20Z test-conn-reset worker-95lpp Error Error: read ECONNRESET    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
May 07 19:49:20Z test-conn-reset worker-95lpp   syscall: 'read'
May 07 19:49:20Z test-conn-reset worker-95lpp   errno: -104,
May 07 19:49:20Z test-conn-reset worker-95lpp   code: 'ECONNRESET',
May 07 19:49:20Z test-conn-reset worker-95lpp   errno: -104,
May 07 19:49:20Z test-conn-reset worker-95lpp   syscall: 'read'
May 07 19:49:20Z test-conn-reset worker-95lpp   syscall: 'read'
May 07 19:49:20Z test-conn-reset worker-95lpp Error Error: read ECONNRESET    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {

I think the best way to handle this scenario is to do what you are already doing: which is to set an idle timeout. Setting the idle timeout to something less than 2hrs and 11 mins should help you avoid this issue.

What I ended up with was this configuration (2 hour idle timeout), which seems to work so far:

const sql = postgres({ idle_timeout: 7200 });

Updates:

  1. Ok, just failed again with the idle_timeout: 7200 - trying a new, lower value (idle_timeout: 6000)
  2. idle_timeout: 6000 is also intermittently failing, will try 3000

Update:

Just had another issue with 3000, going to reduce this to a low number of seconds like idle_timeout: 100.

This is unfortunate that we cannot keep a persistent connection to the database with Render.com PostgreSQL.

Following up from Karl's original issue.

I wanted to summarize a few of my observations based on the code in the following two repositories:

I created a database in Render's Europe region and deployed both the above repos as background workers.

For the postgresjs-conn-test repo, I was able to consistently trigger the ECONNRESET log by setting the idleTimeout to 1 hour and querying the db every 50 minutes. I observed that I got the ECONNRESET on every other run. i:e once every 100 minutes.

I have yet to see a similar error in the conn-reset-go service. Both apps are pointing at the exact same database and make the same number of requests in the same time intervals.

I also tested the same setup in Render's US region and I haven't been able to replicate the issue there across both Node and Go apps.

Putting everything together, it seems to me that the most robust solution seems to be to handle this on the client side. I notice that Go's standard library does account for such behavior here:

// QueryContext executes a query that returns rows, typically a SELECT.
// The args are for any placeholder parameters in the query.
func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
	var rows *Rows
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		rows, err = db.query(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.query(ctx, query, args, alwaysNewConn)
	}
	return rows, err
}

This issue might also be related to: #121

Note: I currently work at Render.

I've just released beta.6 which simply sets keepalive from the client to 60 seconds (a postgres db connection is fairly high bandwidth already, so I feel 60 seconds is a good start). I'm curious if something as simple as this will do? Would you mind giving it a go?

If not we could look into doing like the go library, which appears to retry if this connection failure happens.

I've just released beta.6 which simply sets keepalive from the client to 60 seconds (a postgres db connection is fairly high bandwidth already, so I feel 60 seconds is a good start). I'm curious if something as simple as this will do? Would you mind giving it a go?

Nice! Going to deploy a new version with beta.6 then. Am I understanding correctly that I can remove the idle_timeout config?

It depends on the idle handling on the server you connect to. If there's a load balancer in between (eg. pgbouncer) it might not respect tcp keep alives, but only care about timeout after no queries (i don't know, but you could most likely find that information somewhere).

If my understanding of the issue is correct, what happens is that the server closes the tcp socket abruptly, so that the connected Postgres.js client doesn't know it's gone. Then when the next query is executed an error is returned when trying to write because the connection is closed, and hence the query errors.

Now if the server has a rule that it always closes a connection after a specific amount of time (regardless of idle time) there's not much else we can do but try to reconnect on failure. That would require some logic around backoff + retries, but it's definitely doable.

Ok, beta.6 is now deployed, and the configuration for idle_timeout has been removed - let's see!

Haven't had any issues since the deploy... I'll keep watching!

That's great @karlhorky - Thanks for testing!

Still no incidents since the deploy (21 days now)

Maybe I'll wait for this feature to land in non-beta v2 and then we can close this.