luckyframework/avram

Lost connections are not retried

Closed this issue · 5 comments

avram/src/avram/database.cr

Lines 188 to 201 in 0b9797b

private def with_connection(&)
key = object_id
connections[key] ||= db.checkout
connection = connections[key]
begin
yield connection
ensure
if !connection._avram_in_transaction?
connection.release
connections.delete(key)
end
end
end

In the lines of code above, Avram::Database#with_connection helper checks out a connection and checks it in after use. But if the connection is lost/broken, the query is never retried, leading to DB::ConnectionLost errors:

(DB::ConnectionLost)
  from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
  from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'exec_query'
  from /tmp/lucky/lib/avram/src/avram/queryable.cr:291:7 in 'results'
  from /usr/share/crystal/src/indexable.cr:749:5 in 'first?'
  from /tmp/lucky/lib/shield/src/presets/user.cr:34:3 in 'run_operation'
  from /tmp/lucky/src/actions/current_login/create.cr:4:3 in 'call'
  from /tmp/lucky/lib/lucky/src/lucky/renderable.cr:130:16 in 'perform_action'
  from /tmp/lucky/lib/lucky/src/lucky/route_handler.cr:10:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /usr/share/crystal/src/time.cr:357:5 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /tmp/lucky/lib/lucky/src/lucky/force_ssl_handler.cr:36:8 in 'call'
  from /tmp/lucky/lib/lucky/src/lucky/request_id_handler.cr:24:5 in 'call'
  from /usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'handle_client'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

This affects multiple apps of mine in a staging environment. I use a serverless database provide who suspends the server after a period of inactivity (it boots up again after a connection attempt is made). This breaks the connection if no interaction is made to the database for a while.

I've applied a patch and tested for about 3 days now; no errors yet. I used to get those errors almost daily from both the app server and Mel workers attempting to reach the database. Here is a trace from Mel:

(DB::ConnectionLost)
  from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
  from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'run'
  from /tmp/lucky/lib/mel/src/worker/instant_task.cr:4:3 in '->'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

My patch looks like this:

  private def with_connection(&)
    key = object_id

    db.retry do # Retries lost connections according to the `retry_*` parameters in the connection URL.
      connections[key] ||= db.checkout
      connection = connections[key]

      begin
        yield connection
      ensure
        if !connection._avram_in_transaction?
          connection.release
          connections.delete(key)
        end
      end
    end
  end

Or maybe it's something else I've missed?

I've also noticed this in my apps, but haven't been able to manually replicate to narrow it down. I guess I thought that crystal-db just did the retry stuff internally, but your patch makes sense.

I'm not sure what edge cases we'd have to consider here, but I'm down to give it a shot.

I guess I thought that crystal-db just did the retry stuff internally,...

From Brian Cardiff's comments here and here, it does until you do an explicit call to #checkout.

I'm not sure what edge cases we'd have to consider here...

I have a PR that passes all existing tests, but I'm not sure how to add a test case for this one. I should probably wait for a few more days to see if the issue does not recur in my staging apps to be really sure.

Ah, ok. It looks like your patch is probably good then.

Oh, what happens if you don't have any retry_ params set? Does it still do a retry, or is it just ignored in that case?

Oh, what happens if you don't have any retry_ params set?

The default retry of 1 is used. If exceeded, DB::PoolRetryAttemptsExceeded exception is raised:

https://github.com/crystal-lang/crystal-db/blob/06df272740fb9141050681ae916c465cc8e70584/src/db/pool.cr#L239

In all cases, the app should work OK if connection is never lost; retry kicks in only after all connections in the pool are broken. I've just rerun the script/test with retry_attempts=0 for TestDatabase and everything checks out.

I've also not received any error report from staging yet, so seems we're good. I'm going to wait for 1 more day.

That's great news! Thanks for investigating this.