release_stale_locks! might never release a stale lock

Question

release_stale_locks! might never release a stale lock

mfischer-zd opened this issue 10 years ago · 5 comments

The release_stale_locks! function might never actually release stale locks under certain conditions. This is because it's possible for a deadlock to occur in simple_mutex (called by the above function) because the TTL gets reset at each GETSET operation. This is expected per http://redis.io/commands/expire:

The timeout is cleared only when the key is removed using the DEL command or overwritten using the SET or GETSET commands.

    def simple_mutex(key_name, expires = nil)
      key_name = namespaced_key(key_name) if key_name.kind_of? Symbol
      token = @redis.getset(key_name, API_VERSION)  # TTL cleared here if set below

      return false unless token.nil?
      @redis.expire(key_name, expires) unless expires.nil?

      begin
        yield token
      ensure
        @redis.del(key_name)
      end
    end

The expire above is useless because it gets cleared as soon as another candidate attempts to acquire the mutex. Here's an observation of the effect from the Redis CLI; commands were run immediately in order (i.e. there wasn't a 9-second wait between queries):

redis:6379> EXPIRE SEMAPHORE:task:release_locks 10
(integer) 1
redis:6379> TTL SEMAPHORE:task:release_locks
(integer) 9
redis:6379> TTL SEMAPHORE:task:release_locks
(integer) -1

In theory, the ensure clause should prevent this from happening (the DEL operation is supposed to release the simple mutex), but there are conditions like an unexpected SIGKILL where this clause might not be run.

apeschel commented 9 years ago

bump

Answer 1 · 2014-08-05T21:06:31.000Z

A workaround would be to ensure that two clients don't call release_stale_locks! within 10 seconds of each other, but that's not always possible.

Answer 2 · 2015-01-11T16:50:42.000Z

Thanks mfischer-zd for the report! I'll look into it.

Answer 3 · 2015-06-15T19:23:43.000Z

Did you find something?

Answer 4 · 2016-04-17T12:08:12.000Z

Finally I had some time to look into it. I used the Redis' SETNX mechanism to ensure the stale lock gets cleaned up correctly.

Could someone sanity check the new code, and check if it fixes this problem? I'll carve out a new gem version once that's done.