ReplLockManager does not prolongate autoUnlockTime if there is no last known leader
kaxap opened this issue · 3 comments
This happens at the cluster's start time, when the nodes are started one by one.
For example for a cluster of 3 nodes:
- First node is up. There is no leader (as expected). The
_autoAcquireThread
will not prolongate the lock's timestamp since there is no leader - Wait for autoUnlockTime seconds
- Start second node.
_ReplLockManagerImpl
'sacquire
function is called on both nodes- Both nodes have old timestamp for the lock (since it has not been prolongated at this point)
- Both nodes acquire the lock
Steps to reproduce
- Use the following code for a node
import logging
import os
import time
from pysyncobj import SyncObj, SyncObjException
from pysyncobj.batteries import ReplLockManager
logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
level=logging.DEBUG,
datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger("raft-test")
me = os.getenv("R_ME")
others = os.getenv("R_OTHERS").split(",")
logger.info("R_ME=" + me)
logger.info("R_OTHERS=" + str(others))
lockManager = ReplLockManager(autoUnlockTime=5)
syncObj = SyncObj(me, others, consumers=[lockManager])
logger.info("Waiting for the lock")
while True:
try:
if lockManager.tryAcquire('testLockName', sync=True):
logger.info("Acquired the lock")
time.sleep(15)
syncObj.destroy()
time.sleep(1)
quit()
except SyncObjException as e:
logger.error(f"SyncObjException: '{e}'")
- Run first node:
R_ME=127.0.0.1:5555 R_OTHERS=127.0.0.1:5556,127.0.0.1:5557 python run.py
- Wait 6+ seconds
- Run second node
R_ME=127.0.0.1:5556 R_OTHERS=127.0.0.1:5555,127.0.0.1:5557 python run.py
Expected behaviour
Lock is acquired by one of the nodes.
Actual behaviour
Lock is acquired by both nodes:
Node 1 logs:
2019-09-01 00:39:13 INFO R_ME=127.0.0.1:5555
2019-09-01 00:39:13 INFO R_OTHERS=['127.0.0.1:5556', '127.0.0.1:5557']
2019-09-01 00:39:13 INFO Waiting for the lock
2019-09-01 00:39:19 INFO Acquired the lock
Node 2 logs:
2019-09-01 00:39:19 INFO R_ME=127.0.0.1:5556
2019-09-01 00:39:19 INFO R_OTHERS=['127.0.0.1:5555', '127.0.0.1:5557']
2019-09-01 00:39:19 INFO Waiting for the lock
2019-09-01 00:39:19 INFO Acquired the lock
_ReplLockManagerImpl should return tuple - acquire result and acquire time. You should check this time manually in two places:
- inside tryAcquire, simmilar to your current fix
- inside callback, in case tryAcquire called in async mode.
BTW, Sory for long answer, if you no longer interested - I'll fix later myself.
hey, thanks for the heads up. However, I don't quite understand how it'll help with the concerns you've expressed in #109 (comment)
This is because now you don't check the replicated value inside your class (which can be replicated with a delay). You take it directly from a response. Responses are not replicated in raft journal - they are RPC, you are guaranteed to get the result returned by the function.