bakwc/PySyncObj

ReplLockManager does not prolongate autoUnlockTime if there is no last known leader

kaxap opened this issue · 3 comments

kaxap commented

This happens at the cluster's start time, when the nodes are started one by one.
For example for a cluster of 3 nodes:

  1. First node is up. There is no leader (as expected). The _autoAcquireThread will not prolongate the lock's timestamp since there is no leader
  2. Wait for autoUnlockTime seconds
  3. Start second node.
  4. _ReplLockManagerImpl's acquire function is called on both nodes
  5. Both nodes have old timestamp for the lock (since it has not been prolongated at this point)
  6. Both nodes acquire the lock

Steps to reproduce

  1. Use the following code for a node
import logging
import os

import time
from pysyncobj import SyncObj, SyncObjException
from pysyncobj.batteries import ReplLockManager

logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s',
    level=logging.DEBUG,
    datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger("raft-test")

me = os.getenv("R_ME") 
others = os.getenv("R_OTHERS").split(",") 

logger.info("R_ME=" + me)
logger.info("R_OTHERS=" + str(others))

lockManager = ReplLockManager(autoUnlockTime=5)
syncObj = SyncObj(me, others, consumers=[lockManager])

logger.info("Waiting for the lock")
while True:
  try:
    if lockManager.tryAcquire('testLockName', sync=True):
      logger.info("Acquired the lock")
      time.sleep(15)
      syncObj.destroy()
      time.sleep(1)
      quit()
  except SyncObjException as e:
    logger.error(f"SyncObjException: '{e}'")
  1. Run first node: R_ME=127.0.0.1:5555 R_OTHERS=127.0.0.1:5556,127.0.0.1:5557 python run.py
  2. Wait 6+ seconds
  3. Run second node R_ME=127.0.0.1:5556 R_OTHERS=127.0.0.1:5555,127.0.0.1:5557 python run.py

Expected behaviour

Lock is acquired by one of the nodes.

Actual behaviour

Lock is acquired by both nodes:
Node 1 logs:

2019-09-01 00:39:13 INFO     R_ME=127.0.0.1:5555
2019-09-01 00:39:13 INFO     R_OTHERS=['127.0.0.1:5556', '127.0.0.1:5557']
2019-09-01 00:39:13 INFO     Waiting for the lock
2019-09-01 00:39:19 INFO     Acquired the lock

Node 2 logs:

2019-09-01 00:39:19 INFO     R_ME=127.0.0.1:5556
2019-09-01 00:39:19 INFO     R_OTHERS=['127.0.0.1:5555', '127.0.0.1:5557']
2019-09-01 00:39:19 INFO     Waiting for the lock
2019-09-01 00:39:19 INFO     Acquired the lock
bakwc commented

_ReplLockManagerImpl should return tuple - acquire result and acquire time. You should check this time manually in two places:

  1. inside tryAcquire, simmilar to your current fix
  2. inside callback, in case tryAcquire called in async mode.

BTW, Sory for long answer, if you no longer interested - I'll fix later myself.

kaxap commented

hey, thanks for the heads up. However, I don't quite understand how it'll help with the concerns you've expressed in #109 (comment)

bakwc commented

This is because now you don't check the replicated value inside your class (which can be replicated with a delay). You take it directly from a response. Responses are not replicated in raft journal - they are RPC, you are guaranteed to get the result returned by the function.