ReplLockManager does not prolongate autoUnlockTime if there is no last known leader

Question

ReplLockManager does not prolongate autoUnlockTime if there is no last known leader

kaxap opened this issue 5 years ago · 3 comments

This happens at the cluster's start time, when the nodes are started one by one.
For example for a cluster of 3 nodes:

First node is up. There is no leader (as expected). The _autoAcquireThread will not prolongate the lock's timestamp since there is no leader
Wait for autoUnlockTime seconds
Start second node.
_ReplLockManagerImpl's acquire function is called on both nodes
Both nodes have old timestamp for the lock (since it has not been prolongated at this point)
Both nodes acquire the lock

Steps to reproduce

Use the following code for a node

import logging
import os

import time
from pysyncobj import SyncObj, SyncObjException
from pysyncobj.batteries import ReplLockManager

logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
    level=logging.DEBUG,
    datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger("raft-test")

me = os.getenv("R_ME") 
others = os.getenv("R_OTHERS").split(",") 

logger.info("R_ME=" + me)
logger.info("R_OTHERS=" + str(others))

lockManager = ReplLockManager(autoUnlockTime=5)
syncObj = SyncObj(me, others, consumers=[lockManager])

logger.info("Waiting for the lock")
while True:
  try:
    if lockManager.tryAcquire('testLockName', sync=True):
      logger.info("Acquired the lock")
      time.sleep(15)
      syncObj.destroy()
      time.sleep(1)
      quit()
  except SyncObjException as e:
    logger.error(f"SyncObjException: '{e}'")

Run first node: R_ME=127.0.0.1:5555 R_OTHERS=127.0.0.1:5556,127.0.0.1:5557 python run.py
Wait 6+ seconds
Run second node R_ME=127.0.0.1:5556 R_OTHERS=127.0.0.1:5555,127.0.0.1:5557 python run.py

Expected behaviour

Lock is acquired by one of the nodes.

Actual behaviour

Lock is acquired by both nodes:
Node 1 logs:

2019-09-01 00:39:13 INFO     R_ME=127.0.0.1:5555
2019-09-01 00:39:13 INFO     R_OTHERS=['127.0.0.1:5556', '127.0.0.1:5557']
2019-09-01 00:39:13 INFO     Waiting for the lock
2019-09-01 00:39:19 INFO     Acquired the lock

Node 2 logs:

2019-09-01 00:39:19 INFO     R_ME=127.0.0.1:5556
2019-09-01 00:39:19 INFO     R_OTHERS=['127.0.0.1:5555', '127.0.0.1:5557']
2019-09-01 00:39:19 INFO     Waiting for the lock
2019-09-01 00:39:19 INFO     Acquired the lock

Answer 1 · 2020-03-24T14:01:29.000Z

_ReplLockManagerImpl should return tuple - acquire result and acquire time. You should check this time manually in two places:

inside tryAcquire, simmilar to your current fix
inside callback, in case tryAcquire called in async mode.

BTW, Sory for long answer, if you no longer interested - I'll fix later myself.

Answer 2 · 2020-03-24T17:18:23.000Z

hey, thanks for the heads up. However, I don't quite understand how it'll help with the concerns you've expressed in #109 (comment)

Answer 3 · 2020-03-24T20:39:12.000Z

This is because now you don't check the replicated value inside your class (which can be replicated with a delay). You take it directly from a response. Responses are not replicated in raft journal - they are RPC, you are guaranteed to get the result returned by the function.