tkem/cachetools

TTLCache - Odd crash on cache item set

Closed this issue · 9 comments

Describe the bug
Just got the following crash log in production (cannot reproduce in dev), where a cache set of a specific key has permanently got "stuck", each time firing an unhandled exception when doing cache[key] = value.

Expected result
No exception, or documented exception.

Actual result
Exception

Reproduction steps
Code is more or less like this:

from cachetools import TTLCache

CACHE_KEY = "prefix:%s"
CACHE_TIME = 10

cache = TTLCache(maxsize=256, ttl=CACHE_TIME)

def some_fun():
        # ... some code
        cache_key = CACHE_KEY % f"{key_1}:{key_2}:{key_3}"
        exists = cache.get(cache_key)

        if exists:
            return exists
        else:
            # grab some info from somewhere
            exists = my_expensive_data
            cache[cache_key] = (exists, other_data)

The cache set call fails with the following:

Traceback (most recent call last):
  ...redacted....
  File "/var/app/current/file.py", line 80, in some_fun
    cache[cache_key] = (my_expensive_data, other_data)
  File "/var/app/venv/staging-LQM1lest/lib/python3.7/site-packages/cachetools/__init__.py", line 426, in __setitem__
    self.expire(time)
  File "/var/app/venv/staging-LQM1lest/lib/python3.7/site-packages/cachetools/__init__.py", line 480, in expire
    cache_delitem(self, curr.key)
  File "/var/app/venv/staging-LQM1lest/lib/python3.7/site-packages/cachetools/__init__.py", line 94, in __delitem__
    del self.__data[key]
KeyError: 'prefix:Token:XXX:YYY'

Any thoughts on why would this fail? This code has been running like this for years and it's the first time I observe this behaviour.

tkem commented

@cristianoccazinsp: Any chance that you are accessing the cache from multiple threads concurrently without proper locking?

So far, I've seen these kinds of "strange" and hard to reproduce errors only when using multithreading without proper synchronization. Of course, there might be some undiscovered bugs lurking within TTLCache, but please check your threads and locks before we'll investigate this further.

@tkem thanks for the reply. You are probably correct, this is being used in a django project for authentication caching so I guess we have threads competing for the cache access.

I really thought the cache was thread safe though. Any advice for making it thread safe? Using a lock on read/write will effectively serialize al requests, possibly making it worse than not using a cache at all.

Further, any chance the lines where the crash come from can be reviewed? Looks like the code always assume there’s an item to delete and never catches an error there. It’s weird to believe an error like that renders the whole cache useless until re created.

tkem commented

https://cachetools.readthedocs.io/en/latest/#cache-implementations:

Please be aware that all these classes are not thread-safe. Access to a shared cache from multiple threads must be properly synchronized, e.g. by using one of the memoizing decorators with a suitable lock object.

In cachetools, caches are just "mutable mappings", and these are just not thread-safe in Python (unless stated otherwise).

Note that in Cpython, you might get away with using dict unsynchronized, but usually only as long as you are not using custom hash keys. So this is more a feature of the implementation than the language spec.

Well, this code is using cpython and is similar to the above, wouldn’t the GIL prevent any unexpected issues due to concurrency? I still agree that the code should probably add synchronization, but the bug may still be there regardless and is worth reviewing.

Are decorators the only way to provide a lock object?

tkem commented

The GIL basically prevents context switches in native code, but may switch contexts in your Python code. That's why you usually get away with using dict unsynchronized from multiple threads, at least if you're only using standard types as keys. If you provide your own key types (with custom hash functions), or start iterating over dict items while the dict is modified by another thread, you'll experience similar problems.

How to deal with this - efficiently - depends on your use case. There's no general recipe for making a mutable mapping "thread-safe" (whatever that means in your context), at least none that I'm aware of, that's also efficient and takes e.g. iter() into account. The best thing cachetools has to offer is indeed the decorator approach, because that only needs to take care of limited ways to access the underlying cache.

'nuff said ;-)

Besides synchronization, shouldn’t the following line at least fail gracefully or check the key to expire is indeed there?

‘’’
File "/var/app/venv/staging-LQM1lest/lib/python3.7/site-packages/cachetools/init.py", line 94, in delitem
del self.__data[key]
‘’’

Sorry for the formatting, writing from a phone.

tkem commented

This could be (and has been) discussed at length (please search the closed issues for further information), but my ultimate answer is: no.

This would - maybe - cure the symptom (KeyError being raised), but not the cause (multithreaded access without synchronization).

By the time you encounter this, chances are that TTLCache's underlying data structures are already corrupted, which may result in other, even more hard-to-detect errors, e.g. memory leaks.

Thanks, I will look into adding synchronization and report back if the error ever happens again. It was funny though that the server hd been running without corruption for a year or so and only recently it happened!

tkem commented

That's why they call it "race condition" ;-)

Please feel free to re-open if this error persists.