zombocom/rack-timeout

Broken ActiveRecord connections after a timeout

collimarco opened this issue · 4 comments

Hello,

I have read many similar issues, but they all was about old versions, so I would like to report what happened to me and maybe get some advice in order to prevent this in the future.

We have a Rails 6 app that runs on Heroku. We use the latest version of this gem and everything went smoothly for months, however today something bad happened. After a timeout our service went completely down for about 1 hour, until our manual restart of the dyno fixed the issue.

A method (show) that simply searches an item by ID in the database (a very common query), strangely raised this error:

Rack::Timeout::RequestTimeoutException: Request waited 7ms, then ran for longer than 25000ms

After that, all the queries to the database started returning this error:

ActiveRecord::StatementInvalid: PG::DuplicatePstatement: ERROR: prepared statement "a1" already exists

Basically without a manual restart of the application, the application would be still down.

This is scary because Heroku doesn't restart the dyno automatically in this case: indeed the Rails process is not crashed, it simply returns 500 to every request because the connection to the database is broken.

  1. Is there any way to fix this inside this gem?
  2. Should we report this to Rails / ActiveRecord? Why the connection was not healed automatically after some time?
  3. Is there any way to tell Heroku to restart the dyno in this case (when it detects high rates of 500 errors)?

This is more likely a Heroku issue you need to bring up with Heroku Support. They may have better logs to identify the issue. Maybe start a Stack Overflow thread as well to get a wider response from the online community.

Depending on your plan with Heroku, you may be able to get them to assist with the restart of the dyno or you can create your own trigger to restart dynos daily as depending on your plan, they may have moved your DB during that period which caused the issue. I have experienced this before and as we were on a Hobby plan, they did not help us, so we added our own restart task daily so that we work on our timezones and not theirs. (Add Heroku Schedular and add a task "heroku ps:restart --app xxxxx")

rack-timeout threw the error, Rack::Timeout::RequestTimeoutException:, which confirms it caught it, but it could be your worker thread or any other puma or background task kept on retrying or attempting to complete the task but due to DB connectivity could not commit the transaction or query.

@collimarco Please consider the options that have been suggested in #49 ( statement_timeout, RACK_TIMEOUT_TERM_ON_TIMEOUT)

Without term on timeout, not only connections but also almost every code can be broken completely.
I described the detail in #169 and am proposing to set term on timeout by default.

Is there any way to fix this inside this gem?

No, but you can use the "TERM on timeout" feature.

Should we report this to Rails / ActiveRecord? Why the connection was not healed automatically after some time?

It's been reported but it's not considered something they can fix. You can opt out of using prepared statements (and this is actually recommended by Heroku to turn this off) which does avoid this particular issue but you can still run into cases where the connection can enter a bad state.

Is there any way to tell Heroku to restart the dyno in this case (when it detects high rates of 500 errors)?

No -- there's infinite causes for a 500 and while restarting might work in some cases, the solution most apps use here would be to monitor the 500 rate to page a human to address the root cause of the 500s.