Memory leak on GAE Standard Python 2.7

Question

Memory leak on GAE Standard Python 2.7

remko opened this issue 4 years ago · 11 comments

I am experiencing what seems to be a memory leak when using google-cloud-ndb 1.7.3 (and older) on GAE Standard Python 2.7.

The simplest reproducer I could find was this: I have a (webapp2) handler that does the following:

class MyModel(ndb.Model):
  _use_global_cache = False
  _use_cache = False

class DBHandler(webapp2.RequestHandler):
  def get(self):
    global key
    for i in range(500):
      with client.context():
        if key is None:
          key = MyModel.query().get().key
        else:
          key.get()

(I'm creating multiple context to get the most out of one request; in reality, I have only one context per request).
I have deployed this to a F4_1G instance, with max_instance set to 1 (so it doesn't scale), and hit the handler endpoint in a serial loop (so no parallelism). After about an hour of querying this endpoint (which takes about 10s each call), I get the following error:

 Exceeded soft memory limit of 2048 MB with 2051 MB after servicing 469 requests total. Consider setting a larger instance class in app.yaml.

This sounds like a memory leak?

In another reproducer, I only have 1 context per request (as normal). When I hit this endpoint 1000 times, I used pympler to go through all strings that are staying behind in memory (pympler does a GC when going through memory), and count the occurrence of identical strings. This is what I get (in the form of <count>: <string>) :

  6944:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/tasklets.py", line 315, in _advance_tasklet
            yielded = self.generator.send(send_value)
        6944:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/tasklets.py", line 494, in tasklet_wrapper
            future._advance_tasklet()
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
            rv = self.router.dispatch(request, response)
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_query.py", line 340, in has_next_async
            yield self._next_batch()  # First time
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
            return route.handler_adapter(request, response)
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/query.py", line 1214, in wrapper
            return wrapped(self, *dummy_args, _options=query_options)
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_api.py", line 107, in make_call
            return rpc_call()
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_retry.py", line 80, in retry_wrapper
            result = callback(*args, **kwargs)
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_query.py", line 113, in fetch
            while (yield results.has_next_async()):
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/utils.py", line 146, in positional_wrapper
            return wrapped(*args, **kwds)
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_query.py", line 370, in _next_batch
            response = yield _datastore_run_query(query)
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/grpcio-1.0.0/grpc/_channel.py", line 696, in _run_channel_spin_thread
            channel_spin_thread.start()
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_dist/lib/python2.7/threading.py", line 749, in start
            _start_new_thread(self.__bootstrap, ())
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
            return method(*args, **kwargs)
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
            return handler.dispatch()
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/1/google/appengine/runtime/runtime.py", line 165, in HandleRequest
            error)
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/grpcio-1.0.0/grpc/_channel.py", line 502, in future
            None, 0, self._method, None, deadline_timespec)
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 339, in HandleRequest
            return WsgiRequest(environ, handler_name, url, post_data, error).Handle()
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/1/google/appengine/runtime/runtime.py", line 81, in StartNewThread
            return base_start_new_thread(Run, ())
        992:   File "/base/data/home/apps/e~myapp/reproducer/main.py", line 14, in get
            Student.query().get()
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_api.py", line 88, in rpc_call
            call = method.future(request, timeout=timeout)
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/query.py", line 2102, in get_async
            results = yield _datastore_query.fetch(options)
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 271, in Handle
            result = handler(dict(self._environ), self._StartResponse)
        992:   File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/grpcio-1.0.0/grpc/_channel.py", line 728, in create_channel_managed_call
            _run_channel_spin_thread(state, key)
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/query.py", line 2068, in get
            return self.get_async(_options=kwargs["_options"]).result()
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_query.py", line 971, in _datastore_run_query
            "RunQuery", request, timeout=query.timeout
        992:   File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/utils.py", line 114, in wrapper
            return wrapped(*args, **new_kwargs)

(i.e. most of these strings occur 992 times after hitting the endpoint 1000 times).
These repeated strings all seem to come from backtraces being tracked in memory on every NDB
operation. When searching for the bottom of the call stack in those strings,
this seems to point to grpcio creating threads. This seemed to point to this
issue. However, I am
already on the latest version of all libraries, confirmed that the
gc.collect() workaround does not help (as expected, since I expect pympler to do gc.collect()), and AFAICT, the number of threads is not increasing either.

When looking for backlinks of a string in that list, I get the following:

    <type 'str'>-+-['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '_--<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
                 +-['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '_--<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
                 +-[set([]), "        permanent = kwargs.pop('_permanent', True--<listiterator object at 0x3e714d8cf8d0>
                 +-<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
                 +-['  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7-+-['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '_--<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
                                                                                +-(['  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c--['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '_--<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>

AFAICT, all paths except 1 come from the same frame object. The other path
comes from a listiterator object. It seems that the information about the stack
trace is staying behind, but not the thread itself.

Answer 1 · 2021-03-10T22:49:47.000Z

Perhaps @chrisrossi has further thoughts, but I wonder if you have code showing how you are creating ndb client instances?

Answer 2 · 2021-03-11T08:08:27.000Z

@crwilcox The reproducer just uses the default constructor. This is what was omitted:

class MyModel(ndb.Model):
  _use_global_cache = False
  _use_cache = False

client = None

class DBHandler(webapp2.RequestHandler):
  def get(self):
    global key
    global client

    if client is None:
      client = ndb.Client()

    for i in range(500):
      with client.context():
        if key is None:
          key = MyModel.query().get().key
        else:
          key.get()

Answer 3 · 2021-04-13T17:31:29.000Z

@remko Thanks for reporting this issue. I created a reproduce script with some functionality to print memory usage and dump Python objects difference: https://gist.github.com/lidizheng/8224837fa83dff617f4b0c09b18b17bf

The result shows every run of the DBHandler consumes 10MB of actual memory but not additional virtual memory:

However, the virtual memory size is 1.2G, which seems too big for a simple script. An idle Python process uses 24MB of virtual memory.

I suspect the memory usage will plateau after it uses all of its virtual memory, which is after 120 runs of the handler. I guess we will need to look into our library to see what statement claim that amount of memory.

Answer 4 · 2021-04-13T18:22:31.000Z

@lidizheng Thanks for looking into this.

Unfortunately, I don't see the plateauing happening. The reproducer illustrates that it crashes the F4_1G instance after a while.

We use this on a Python27 GAE Standard app in production, and were experiencing constant crashes due to out-of-memory errors. We had to increase this from F2 instances to F4_1G instances before the crashes started disappearing (and they probably only disappear because the instances are shutdown for other reasons before hitting a memory problem, as the reproducer shows that it can still crash F4_1G instances). However, this is a very expensive workaround, which is why I tracked it down and reported it here.

Looking forward to your findings.

Answer 5 · 2021-04-13T18:36:42.000Z

Hi @lidizheng

I was able to reproduce the memory leak on the GAE Python 2.7 environment. I don't think you would be able to reproduce it locally, and strongly suspect it is something in the runtime. The email that @crwilcox forwarded to you about stack traces piling up in memory is from my investigation of this issue. Due to the stack traces piling up, usage of real memory does increase monotonically with time. I can share my test code and methodology with you if you like.

The over claiming of virtual memory is an interesting finding and definitely worth investigating, but may be a separate issue.

Answer 6 · 2021-04-13T18:46:12.000Z

@remko To rule out other possibilities, I wonder if it is possible for you to try to instrument the problematic GAE deployment to dump its memory usage and Python objects? The info dump can either be triggered periodically or via a handler like my reproduce script.

It requires two external dependencies:

pip install -U pympler psutil

Below is the minimum version of memory tracing functions I like to use. You can invoke the print_memory() function to print the actual used memory, virtual memory, and Python objects diff.

import os
import gc
import psutil
from pympler import muppy, summary, tracker, classtracker

tr = tracker.SummaryTracker()
process = psutil.Process(os.getpid())
start_memory = process.memory_info()
last_memory = start_memory
print 'Start Memory Usage:', start_memory

def format_memory(name, used_memory, last_memory):
  diff_memory = used_memory - last_memory
  last_memory = used_memory
  print name, ":", used_memory, "%.2fMB" % (used_memory/1024/1024), "%+.2f" % diff_memory, "%+.2fMB" % (diff_memory/1024/1024)

def print_memory():
  gc.collect()
  global last_memory
  current_memory = process.memory_info()
  format_memory('RSS', current_memory.rss, last_memory.rss)
  format_memory('VMS', current_memory.vms, last_memory.vms)
  last_memory = current_memory
  sum1 = tr.diff()
  summary.print_(sum1)

In my local run, the memory usage plateaued after 11 runs. Here is the log of 145 runs: https://gist.github.com/lidizheng/17c75b87915f502498f59d9b2d367775

After 11 runs, the actual memory usage swings between 170MB to 200MB, which didn't increase indefinitely. The common cause of Python memory leaks are threads and global variables.

Based on the tracemalloc result provided by @chrisrossi. Threads might be leaked by ndb or grpcio. What I observe is after 145 runs, there are only 16 threads. The increasing threads at the beginning might be a side-effect of ThreadPoolExecutor being hydrated.

    PID NLWP COMMAND
 580082   16 python a.py

In the provided snippet, I see some usage of global variables. I want to rule out the possibility of leaking global containers, hence I hope you can instrument the deployment.

I will run this in GAE Python 2.7. Above results is gathered on Python 2 Linux.

Answer 7 · 2021-04-13T19:06:14.000Z

Hi @lidizheng

I just wanted to make sure that you saw my note above. I'm reasonably certain that I've already found the objects that are leaking. Let me know how I can help.

Chris

Answer 8 · 2021-04-23T09:31:17.000Z

@lidizheng Given @chrisrossi 's comment, I assume I don't need to do the experiment anymore, and you can get more information from him?

I doubt the experiment would lead to any insights: I used pympler to get my original findings, and also tracked the memory usage, and the memory usage did not plateau (as I mentioned, an instance doing just the code above went over 2Gb memory; our real production code also never plateaus).

Answer 9 · 2021-04-23T17:33:18.000Z

@chrisrossi and I discussed offline about what are the leaked objects. Our finding is aligned with your description:

These repeated strings all seem to come from backtraces being tracked in memory on every NDB
operation. When searching for the bottom of the call stack in those strings,
this seems to point to grpcio creating threads. This seemed to point to this
issue. However, I am
already on the latest version of all libraries, confirmed that the
gc.collect() workaround does not help (as expected, since I expect pympler to do gc.collect()), and AFAICT, the number of threads is not increasing either.

There will be only one spin thread moving information between C++ and Python. None of us observed ever-increasing number of threads. pympler tracks all the Python objects, I was hoping to learn what is accumulating.

My local environment uses the latest gRPC Python 1.37.x, but according to the stacktrace above the GAE is using grpcio 1.0.0, which is very very old. This could the reason that we are seeing different results. I'm not fully sure why GAE is using 1.0.0, but it could be a backward compatible mode for GAE + Python 2. CC @crwilcox, do you know if we can upgrade the version of grpcio in GAE with Python 2?

Answer 10 · 2021-04-23T18:11:41.000Z

+1 I'd also greatly appreciate it if grpcio can be upgrade from 1.0.0 to a more recent version in GAE python2.

Answer 11 · 2021-07-25T07:55:25.000Z

FYI, I myself am no longer in need of a fix for this, as I was finally able to migrate away from Python 2.7. So, I won't be able to verify any fixes made (and I will not raise hell if this is closed 😉)