tkem/cachetools

cachedmethod not working as expected with locks

npezolano opened this issue ยท 4 comments

The below example runs, however it seems like it's not working as expected. In the below example calc will get called multiple times for the same n when it should be locked by a thread and then cached, I tried using both Lock and RLock and get the same result. The cache does seem to be working if I rerun res = obj.multithreaded_run() below everything will be cached, the problem looks to be if the function to be cached takes up to a few seconds to run and its required by multiple threads it won't lock properly. The problem gets worse if you use more threads. This is a follow up on the previous suggestion in #274

from multiprocessing.pool import ThreadPool as Pool

from threading import Lock, RLock
import cachetools
import operator
import time
import numpy as np

cores = 4

class DataObj(object):
    def __init__(self, enable_cache=True, **kwargs):
        self.enable_cache = enable_cache
        self.lock = RLock()
        self.cache = cachetools.LRUCache(maxsize=200)

    @cachetools.cachedmethod(operator.attrgetter('cache'), lock=operator.attrgetter('lock'))
    def calc(self, n): # Should only be called once for each n
        print(f'running n : {n}')
        sleeptime = np.random.uniform(0, 1) 
        time.sleep(sleeptime) # Sleep timer to simulate loading some data
        return n

class RunnerObj(object):
    def __init__(self, data_obj=None, window=60, size=100, **kwargs):
        self.data_obj = data_obj
        self.window = window
        self.size = size

    def run(self, data):
        n = 0
        for i in data:
            n += self.data_obj.calc(i)
        return n

    def multithreaded_run(self):
        seq = [i for i in range(self.size)]
        window_size = self.window
        data = []
        # Create example window of data to run
        for i in range(len(seq) - window_size + 1):
           data.append(seq[i: i + window_size])

        pool = Pool(cores)
        res = pool.map(self.run, data)
        return res


obj = RunnerObj(data_obj=DataObj())
res = obj.multithreaded_run()

See screenshot for example: each n should only print once:
image

7677e50 looks to fix the issues I'm facing above but causing a small performance impact

tkem commented

@npezolano: the title of this issue "not working as expected with locks" is something that I would have to admit is probably true, sorry.
However, it is AFAIK the same behavior the standard library's @lru_cache decorator shows, and any PRs trying to fix this have been rejected due to added complexity and runtime overhead. See #224 for a somewhat thorough discussion.
So, sorry, unless someone comes up with some truly ingenious idea, this will be left open.

tkem commented

On a final note: Yes, it's an issue, probably an issue that should be at least stated in the docs.
However, if this is a real issue for your individual use case, it's probably also an issue in your back-end service, i.e. this would arise even without caching (probably more so), and maybe should be handled there...

I'm sure removing thread safety everywhere in the documentation will fix the issue @tkem