seomoz/qless

`Qless::Queue#peek` not returning correct results for recurring jobs

Opened this issue · 5 comments

As of 515cdc8 a queue can get in an unexpected state leading to recurring jobs not being returned properly by #peek when no count argument is supplied. At this point I've traced it down to the Lua scripts by working back through the ruby code, I'll post more details as I find them.

[14] pry(main)> queue.peek
=> nil
[15] pry(main)> queue.peek(1)
=> []
[16] pry(main)> queue.peek(2)
=> [<Qless::Job SyncAccount (f5aa9babcd1e420ea38fe537812db03a-9 / power_school_jobs / waiting)>]
[17] pry(main)> queue.peek(3)
=> [<Qless::Job SyncAccount (f5aa9babcd1e420ea38fe537812db03a-9 / power_school_jobs / waiting)>, <Qless::Job SyncAccount (ed6f4e5207454cf3972459990b8eaad3-15 / power_school_jobs / waiting)>]
[18] pry(main)>

Additional data from invoking the Lua 'directly'

18] pry(main)> client.call('peek', 'power_school_jobs', 1)
=> "{}"
[19] pry(main)> client.call('peek', 'power_school_jobs')
Qless::LuaScriptError: user_script:1105: Peek(): Arg "count" missing or not a number: nil
from /home/deploy/pschool/.bundle/gems/ruby/2.1.0/bundler/gems/qless-515cdc862a82/lib/qless/lua_script.rb:30:in `rescue in call'
[20] pry(main)> client.call('peek', 'power_school_jobs', 20)
=> "[{\"jid\":\"f5aa9babcd1e420ea38fe537812db03a-9\",\"retries\":5,\"data\": ...snip..."

Turns out there was a lingering member in the zset ql:q:<queue_name>-work, I'm not sure how it remained after the job was removed from the queue or completed. I'll be investigating the housecleaning functions to see where/how this might have happened.

So far I'm not seeing anything that would point at a root cause, the only set that seems to have had this pollution is the work set.

I just discovered that the completed set has invalid values too, sadly I haven't taken good enough notes to know if there is any overlap between the extras in the work set and ql:completed. I'm going to leave this issue open while I gather more data, this problem has occurred a couple times in the past.

We had more issues with one of the sorted sets being in an inconsistent state this morning, this time it was the locks set on the same queue but the inconsistent state started occurring right after we received notification of an exception of the type Qless::Job::CantCompleteError.