python/cpython

functools.lru_cache keeps objects alive forever

thesheep opened this issue · 13 comments

BPO 19859
Nosy @rhettinger, @ncoghlan, @pitrou, @serhiy-storchaka
Files
  • 66c1c9f32567.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2013-12-04.21:24:57.035>
    created_at = <Date 2013-12-02.10:27:47.237>
    labels = ['type-feature', 'library']
    title = 'functools.lru_cache keeps objects alive forever'
    updated_at = <Date 2013-12-04.21:24:57.033>
    user = 'https://bugs.python.org/thesheep'

    bugs.python.org fields:

    activity = <Date 2013-12-04.21:24:57.033>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2013-12-04.21:24:57.035>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2013-12-02.10:27:47.237>
    creator = 'thesheep'
    dependencies = []
    files = ['32950']
    hgrepos = ['215']
    issue_num = 19859
    keywords = ['patch']
    message_count = 13.0
    messages = ['204995', '205120', '205146', '205147', '205148', '205175', '205178', '205208', '205213', '205218', '205220', '205223', '205248']
    nosy_count = 5.0
    nosy_names = ['rhettinger', 'ncoghlan', 'pitrou', 'serhiy.storchaka', 'thesheep']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = None
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue19859'
    versions = ['Python 3.5']

    As most naïve "memoized" decorator implementations, lru_cache keeps references to all the values of arguments of the decorated function in the cache. That means, that if we call such a decorated function with an object as a parameter, that object will be kept alive in memory forever -- that is, until the program ends. This is an obvious waste, since when we no longer have any other reference to that object, we are unable to call that function with the same parameter ever again, so it just wastes the cache space.

    This is a very common case when we decorate a method -- the first parameter is "self". One solution for this particular case is to use a dedicated "memoized_method" decorator, that stores the cache on the "self" object itself, so that it can be released together with the object.

    A more general solution uses weakrefs where possible in the cache key, maybe even with an additional callback that removes the cache entry when any of its parameters is dead. Obviously it adds some overhead and makes the caching decorator even slower, but it can let us save a lot of memory, especially in long-running applications.

    To better illustrate what I mean, here is an example of such an improved @memoized decorator that I wrote: https://review.openstack.org/#/c/54117/5/horizon/utils/memoized.py

    It would be great to have an option to do something similar with lru_cache, and if there is an interest, I would like to work on that.

    Weak references make no sense in most cases when arguments disappear just after function call. For the self argument of a method it makes more sense. Perhaps lru_cache can return not a function, but a descriptor. This will allow implement special processing of first argument.

    Or new decorator lru_cache_method should be added.

    Perhaps following technique can be used to prevent object's life prolongation:

    def spam(self, *args, **kwargs):
        @lru_cache(maxsize=20)
        def spam(foo, bar=1, *, baz=None):
            ...
        self.spam = spam
        return self.spam(*args, **kwargs)

    I don't think this is an appropriate use of an LRU cache. There are other ways to "freeze" a method return value (typically by storing the result in an instance).

    Here's one way of doing it (taken from the source code for Itty https://pypi.python.org/pypi/itty/0.8.2 ):

    class lazyproperty(object):
        """A property whose value is computed only once. """
        def __init__(self, function):
            self._function = function
    
        def __get__(self, obj, _=None):
            if obj is None:
                return self
            value = self._function(obj)
            setattr(obj, self._function.func_name, value)
            return value

    Here is how it is used:

    class Request(object):
        """An object to wrap the environ bits in a friendlier way."""
    ...
    
        @lazyproperty
        def POST(self):
            return self.build_complex_dict()
    ...
    

    The method example is just the most common case where this problem can be easily seen, but not the only one. We indeed use the @cached_property decorator on properties (similar to https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/utils.py#L35), and I've actually written a @memoized_method decorator for methods, that do pretty much what Serhiy suggests (except it's not repeated all over the place, a la copy-pasta, but kept in one decorator). That is fine, and the @cached_property is actually superior, as it avoids a lookup and a check once the value has been calculated.

    However, this still doesn't solve the problems that are encountered in practice in actual code, like here: https://github.com/openstack/horizon/blob/master/openstack_dashboard/api/neutron.py#L735

    Here we have a normal function, not a method, that calls a remote API through HTTP (the call is relatively slow, so we definitely want to cache it for multiple invocations). The function takes a request parameter, because it needs it for authentication with the remote service. The problem we had is that this will keep every single request in memory, because it's referenced by the cache.

    Somehow it feels wrong to store the cache on an arbitrary attribute of the function, like the request in this case, and it's easy to imagine a function that takes two such critical arguments.

    This is the code that actually made me write the weakref version of the @memoized decorator that I linked initially, and I thought that it could also be useful to have that in Python's caching decorator as an option.

    I can understand if you think that this is too much, and that in such tricky situations the programmer should either write their own caching, or rewrite the code to avoid a memory leak. But I am sure that most programmers will not even think about this caveat. I think that we should at least add a note to lru_cache's documentation warning about this scenario and advising them to write their own caching decorators.

    Actually, after looking closer, my @memoize_method decorator does something completely different than Serhiy suggested. Still it only solves the common case of methods, and does nothing if you pass your short-lived objects as other parameters than self.

    Limiting the cache size is also not a solution in the practical example with request that I linked to in the previous comment, because we can't know in advance how many times per request the function is going to be called, picking an arbitrary number feels wrong and may lead to unexpected behaviors when other fragments of code change (like a sudden slowdown every N calls).

    Limiting the cache size is also not a solution in the
    practical example with request that I linked to in the
    previous comment, because we can't know in advance how
    many times per request the function is going to be called,
    picking an arbitrary number feels wrong and may lead to
    unexpected behaviors

    This suggests that you don't really want an LRU cache which is specifically designed to limit the cache size by expelling the
    least recently used entry.

    At its heart, the cache decorator is all about mapping a fixed inputs to fixed outputs. The memory conservation comes from the replacement strategy and an option to clear the cache entirely.

    The reason that my answer and Serhiy's answer don't fit your needs is that it isn't clear what you really want to do. I think you should move this discussion to StackOverflow so others can help you tease-out your actual needs and suggest appropriate solutions. Ideally, you should start with real use cases rather than focusing on hacking-up the LRU cache implementation.

    Thank you for your attention. I'm actually quite happy with the solution we have, it works well. That's actually I thought that it may be worthwhile to try and push it upstream to Python. I can totally understand why you don't want to add too much to the standard library, after all, everything you add there has to stay forever. So please consider this patch abandoned.

    But I think it's would be still worthwhile to add a note to the lru_cache's documentation, saying something like:

    """
    Warning! lru_cache will keep references to all the arguments for which it keeps cached values, which prevents them from being freed from memory when there are no other references. This can lead to memory leaks when you call a function with lru_cache on a lot of short-lived objects.
    """

    I suppose you can come up with a nicer phrasing.

    On 4 December 2013 20:15, Radomir Dopieralski <report@bugs.python.org> wrote:

    But I think it's would be still worthwhile to add a note to the lru_cache's documentation, saying something like:

    """
    Warning! lru_cache will keep references to all the arguments for which it keeps cached values, which prevents them from being freed from memory when there are no other references. This can lead to memory leaks when you call a function with lru_cache on a lot of short-lived objects.
    """

    Umm, that's part of the operational definition of a value based cache

    • it needs to keep things alive, so that if a different instance shows
      up with the same value, it will still get a cache hit.

    We're willing to put warnings in the docs for cases where it's easy to
    inadvertently introduce a security vulnerability, but not for
    situations like this where using a container inappropriately may cause
    it to keep objects alive that you didn't intend to keep alive.

    Umm, that's part of the operational definition of a value based cache

    • it needs to keep things alive, so that if a different instance shows
      up with the same value, it will still get a cache hit.

    If it only kept the return value alive, that wouldn't be a problem, it's indeed intuitively obvious that it has to do that in order to work. But what many people miss to notice is that it also keeps any arguments that were passed to the function alive. This is not intuitive, and as I demonstrated with my patch, not even necessary, so I think it might be worthwhile to at least mention this little implementation quirk.

    Just adding a note in the documentation sounds enough.

    I'm not in favor of filling the docs with warnings like this. It tends to make everything sound dangerous even when the tools are doing exactly what they are supposed to do.

    Every container (except for the weakref containers) keeps their references alive. That is how Python works. The LRU cache is no more special in this regard than a dictionary, list, or set. In addition, the older entries get flushed-out and freed as the LRU cache gets newer entries.

    [Radomir Dopieralski]

    So please consider this patch abandoned.

    Marking this as closed.