Locker doesn't release locks for processed jobs under some circumstances

Question

Locker doesn't release locks for processed jobs under some circumstances

sidonath opened this issue 4 years ago · 6 comments

Hi 👋

I'm noticing an unexpected accumulation of locks by our worker processes. It seems that under some circumstances, advisory locks are not being released by the locker. Here's a query that hopefully illustrates the problem:

with
  pid_locks as
    (select pl.pid, count(*) from pg_locks pl inner join que_lockers ql on pl.pid = ql.pid group by pl.pid),
  pid_locked_jobs as
    (select pl.pid, count(*) from pg_locks pl inner join que_jobs qj on pl.objid = qj.id group by pl.pid)
select ql.pid, worker_count, queues, listening, pid_locks.count as active_locks, pid_locked_jobs.count as active_jobs
from que_lockers ql
left join pid_locks on ql.pid = pid_locks.pid
left join pid_locked_jobs on pid_locked_jobs.pid = ql.pid;

(Our job IDs are still in 32-bit range, so I didn't need to use classid in the pid_locked_jobs join.)

  pid  | worker_count |  queues   | listening | active_locks | active_jobs
-------+--------------+-----------+-----------+--------------+-------------
 15649 |           12 | {default} | t         |         4255 |           3
 19095 |           12 | {default} | t         |          673 |           4
 16646 |           12 | {default} | t         |         1148 |           6
 31188 |           12 | {default} | t         |         7909 |          10
(4 rows)

As you can see, at the moment we have 4 Que processes running (on Heroku, running on 4 different dynos), each with 12 workers. Our throughput is on average 400 jobs/minute, peaking occasionally at 800 jobs/minute. Normally our queue is almost empty.

However, all locker processes have hundreds/thousands of unreleased locks. I've seen the total number of locks exceed 50k and I'm suspecting it lead to out of shared memory errors we experienced earlier this week. I can see this problem locally as well, but I still haven't narrowed down on the simple way to reproduce it.

Does/did anybody see similar behavior?

Any ideas on what to explore/test/log would be appreciated!

Stack details:

Que 1.0.0.beta3 (using the ActiveJob adapter)
Rails 5.2.4.3
Postgresql 12
We have a custom job middleware for reporting job metrics, but I can see the problem happening even when the job middleware disabled

Answer 1 · 2020-08-10T07:39:48.000Z

Hey @sidonath !

Are we talking about deadlocks under heavy load, or just unreleased locks but the workers are fine?

Answer 2 · 2020-08-10T17:16:40.000Z

Hi @siegy22,

In the case I described above I don't see deadlocks—workers continue processing the jobs just fine. In fact, jobs that appear to be locked have already been deleted from que_jobs.

Answer 3 · 2020-08-28T07:19:30.000Z

Hi, We just hit same problem. Problem happens on PG 12 (Heroku), not earlier versions...

Ruby 2.6.6 / Rails 6.0.3.2
Que 1.0.0.beta4 (using the ActiveJob adapter)

Answer 4 · 2020-09-18T20:23:19.000Z

FWIW we stopped seeing this problem for a couple of weeks now. We're still on PG 12 (Heroku), but the number of locks is very small. We have barely one lock per worker as I'm writing this. Might have been a platform issue

Answer 5 · 2021-01-25T13:33:58.000Z

Experiencing the same problem. @sidonath you apparently stopped seeing "out of memory" errors because Heroku restarts dynos/processes often (every 24hours?) so it cleans all locks. On another PaaS provider (scalingo) I see this error under minimal load with:

pg12
rails 5.2.4.4
ruby 2.6.6
que 1.0.0beta4

This is a really big problem for me and I'm considering migrating to Sidekiq :/

Answer 6 · 2021-01-25T22:04:43.000Z

@kofronpi

you apparently stopped seeing "out of memory" errors because Heroku restarts dynos/processes often (every 24hours?) so it cleans all locks

Heroku does that, indeed, but it was also doing that at the time when we observed the locks. Since that part didn't change, I don't think it can explain why we're not seeing the issue anymore.

The only thing we changed between the time when we observed the issue and when we noticed it stopped appearing is the number of Heroku dynos running Que in our peak times: we decreased it from 8 to 6. I don't believe there's a causality link between those two facts.

We're currently on PG12.5 and as I'm writing this we have only 4 locks across 6 Heroku dynos running Que (the oldest one is already 21 hour old).

The rest of the stack looks exactly as yours.