Memory leak on GAE Standard Python 2.7
remko opened this issue · 11 comments
I am experiencing what seems to be a memory leak when using google-cloud-ndb 1.7.3 (and older) on GAE Standard Python 2.7.
The simplest reproducer I could find was this: I have a (webapp2) handler that does the following:
class MyModel(ndb.Model):
_use_global_cache = False
_use_cache = False
class DBHandler(webapp2.RequestHandler):
def get(self):
global key
for i in range(500):
with client.context():
if key is None:
key = MyModel.query().get().key
else:
key.get()
(I'm creating multiple context to get the most out of one request; in reality, I have only one context per request).
I have deployed this to a F4_1G instance, with max_instance
set to 1 (so it doesn't scale), and hit the handler endpoint in a serial loop (so no parallelism). After about an hour of querying this endpoint (which takes about 10s each call), I get the following error:
Exceeded soft memory limit of 2048 MB with 2051 MB after servicing 469 requests total. Consider setting a larger instance class in app.yaml.
This sounds like a memory leak?
In another reproducer, I only have 1 context per request (as normal). When I hit this endpoint 1000 times, I used pympler to go through all strings that are staying behind in memory (pympler does a GC when going through memory), and count the occurrence of identical strings. This is what I get (in the form of <count>: <string>
) :
6944: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/tasklets.py", line 315, in _advance_tasklet
yielded = self.generator.send(send_value)
6944: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/tasklets.py", line 494, in tasklet_wrapper
future._advance_tasklet()
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_query.py", line 340, in has_next_async
yield self._next_batch() # First time
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/query.py", line 1214, in wrapper
return wrapped(self, *dummy_args, _options=query_options)
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_api.py", line 107, in make_call
return rpc_call()
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_retry.py", line 80, in retry_wrapper
result = callback(*args, **kwargs)
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_query.py", line 113, in fetch
while (yield results.has_next_async()):
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/utils.py", line 146, in positional_wrapper
return wrapped(*args, **kwds)
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_query.py", line 370, in _next_batch
response = yield _datastore_run_query(query)
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/grpcio-1.0.0/grpc/_channel.py", line 696, in _run_channel_spin_thread
channel_spin_thread.start()
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_dist/lib/python2.7/threading.py", line 749, in start
_start_new_thread(self.__bootstrap, ())
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/1/google/appengine/runtime/runtime.py", line 165, in HandleRequest
error)
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/grpcio-1.0.0/grpc/_channel.py", line 502, in future
None, 0, self._method, None, deadline_timespec)
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 339, in HandleRequest
return WsgiRequest(environ, handler_name, url, post_data, error).Handle()
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/1/google/appengine/runtime/runtime.py", line 81, in StartNewThread
return base_start_new_thread(Run, ())
992: File "/base/data/home/apps/e~myapp/reproducer/main.py", line 14, in get
Student.query().get()
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_api.py", line 88, in rpc_call
call = method.future(request, timeout=timeout)
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/query.py", line 2102, in get_async
results = yield _datastore_query.fetch(options)
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 271, in Handle
result = handler(dict(self._environ), self._StartResponse)
992: File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7186b0a736e8/python27/python27_lib/versions/third_party/grpcio-1.0.0/grpc/_channel.py", line 728, in create_channel_managed_call
_run_channel_spin_thread(state, key)
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/query.py", line 2068, in get
return self.get_async(_options=kwargs["_options"]).result()
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/_datastore_query.py", line 971, in _datastore_run_query
"RunQuery", request, timeout=query.timeout
992: File "/base/data/home/apps/e~myapp/reproducer/lib/google/cloud/ndb/utils.py", line 114, in wrapper
return wrapped(*args, **new_kwargs)
(i.e. most of these strings occur 992 times after hitting the endpoint 1000 times).
These repeated strings all seem to come from backtraces being tracked in memory on every NDB
operation. When searching for the bottom of the call stack in those strings,
this seems to point to grpcio creating threads. This seemed to point to this
issue. However, I am
already on the latest version of all libraries, confirmed that the
gc.collect()
workaround does not help (as expected, since I expect pympler to do gc.collect()
), and AFAICT, the number of threads is not increasing either.
When looking for backlinks of a string in that list, I get the following:
<type 'str'>-+-['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '_--<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
+-['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '_--<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
+-[set([]), " permanent = kwargs.pop('_permanent', True--<listiterator object at 0x3e714d8cf8d0>
+-<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
+-[' File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c7-+-['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '_--<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
+-([' File "/base/alloc/tmpfs/dynamic_runtimes/python27g/f26c--['__setattr__', '__reduce_ex__', '__new__', '__reduce__', '_--<frame object at 0x3e7100cfa200>--<frame object at 0x3e714d9cc650>
AFAICT, all paths except 1 come from the same frame object. The other path
comes from a listiterator object. It seems that the information about the stack
trace is staying behind, but not the thread itself.
Perhaps @chrisrossi has further thoughts, but I wonder if you have code showing how you are creating ndb client instances?
@crwilcox The reproducer just uses the default constructor. This is what was omitted:
class MyModel(ndb.Model):
_use_global_cache = False
_use_cache = False
client = None
class DBHandler(webapp2.RequestHandler):
def get(self):
global key
global client
if client is None:
client = ndb.Client()
for i in range(500):
with client.context():
if key is None:
key = MyModel.query().get().key
else:
key.get()
@remko Thanks for reporting this issue. I created a reproduce script with some functionality to print memory usage and dump Python objects difference: https://gist.github.com/lidizheng/8224837fa83dff617f4b0c09b18b17bf
The result shows every run of the DBHandler
consumes 10MB of actual memory but not additional virtual memory:
- https://gist.github.com/lidizheng/3f68726c987e8abb758df2d6e74f26b8
- https://gist.github.com/lidizheng/37f6600ad4448803b355fa976bddf310
However, the virtual memory size is 1.2G, which seems too big for a simple script. An idle Python process uses 24MB of virtual memory.
I suspect the memory usage will plateau after it uses all of its virtual memory, which is after 120 runs of the handler. I guess we will need to look into our library to see what statement claim that amount of memory.
@lidizheng Thanks for looking into this.
Unfortunately, I don't see the plateauing happening. The reproducer illustrates that it crashes the F4_1G instance after a while.
We use this on a Python27 GAE Standard app in production, and were experiencing constant crashes due to out-of-memory errors. We had to increase this from F2 instances to F4_1G instances before the crashes started disappearing (and they probably only disappear because the instances are shutdown for other reasons before hitting a memory problem, as the reproducer shows that it can still crash F4_1G instances). However, this is a very expensive workaround, which is why I tracked it down and reported it here.
Looking forward to your findings.
Hi @lidizheng
I was able to reproduce the memory leak on the GAE Python 2.7 environment. I don't think you would be able to reproduce it locally, and strongly suspect it is something in the runtime. The email that @crwilcox forwarded to you about stack traces piling up in memory is from my investigation of this issue. Due to the stack traces piling up, usage of real memory does increase monotonically with time. I can share my test code and methodology with you if you like.
The over claiming of virtual memory is an interesting finding and definitely worth investigating, but may be a separate issue.
@remko To rule out other possibilities, I wonder if it is possible for you to try to instrument the problematic GAE deployment to dump its memory usage and Python objects? The info dump can either be triggered periodically or via a handler like my reproduce script.
It requires two external dependencies:
pip install -U pympler psutil
Below is the minimum version of memory tracing functions I like to use. You can invoke the print_memory()
function to print the actual used memory, virtual memory, and Python objects diff.
import os
import gc
import psutil
from pympler import muppy, summary, tracker, classtracker
tr = tracker.SummaryTracker()
process = psutil.Process(os.getpid())
start_memory = process.memory_info()
last_memory = start_memory
print 'Start Memory Usage:', start_memory
def format_memory(name, used_memory, last_memory):
diff_memory = used_memory - last_memory
last_memory = used_memory
print name, ":", used_memory, "%.2fMB" % (used_memory/1024/1024), "%+.2f" % diff_memory, "%+.2fMB" % (diff_memory/1024/1024)
def print_memory():
gc.collect()
global last_memory
current_memory = process.memory_info()
format_memory('RSS', current_memory.rss, last_memory.rss)
format_memory('VMS', current_memory.vms, last_memory.vms)
last_memory = current_memory
sum1 = tr.diff()
summary.print_(sum1)
In my local run, the memory usage plateaued after 11 runs. Here is the log of 145 runs: https://gist.github.com/lidizheng/17c75b87915f502498f59d9b2d367775
After 11 runs, the actual memory usage swings between 170MB to 200MB, which didn't increase indefinitely. The common cause of Python memory leaks are threads and global variables.
Based on the tracemalloc
result provided by @chrisrossi. Threads might be leaked by ndb
or grpcio
. What I observe is after 145 runs, there are only 16 threads. The increasing threads at the beginning might be a side-effect of ThreadPoolExecutor
being hydrated.
PID NLWP COMMAND
580082 16 python a.py
In the provided snippet, I see some usage of global variables. I want to rule out the possibility of leaking global containers, hence I hope you can instrument the deployment.
I will run this in GAE Python 2.7. Above results is gathered on Python 2 Linux.
Hi @lidizheng
I just wanted to make sure that you saw my note above. I'm reasonably certain that I've already found the objects that are leaking. Let me know how I can help.
Chris
@lidizheng Given @chrisrossi 's comment, I assume I don't need to do the experiment anymore, and you can get more information from him?
I doubt the experiment would lead to any insights: I used pympler to get my original findings, and also tracked the memory usage, and the memory usage did not plateau (as I mentioned, an instance doing just the code above went over 2Gb memory; our real production code also never plateaus).
@chrisrossi and I discussed offline about what are the leaked objects. Our finding is aligned with your description:
These repeated strings all seem to come from backtraces being tracked in memory on every NDB
operation. When searching for the bottom of the call stack in those strings,
this seems to point to grpcio creating threads. This seemed to point to this
issue. However, I am
already on the latest version of all libraries, confirmed that the
gc.collect() workaround does not help (as expected, since I expect pympler to do gc.collect()), and AFAICT, the number of threads is not increasing either.
There will be only one spin thread moving information between C++ and Python. None of us observed ever-increasing number of threads. pympler
tracks all the Python objects, I was hoping to learn what is accumulating.
My local environment uses the latest gRPC Python 1.37.x, but according to the stacktrace above the GAE is using grpcio
1.0.0, which is very very old. This could the reason that we are seeing different results. I'm not fully sure why GAE is using 1.0.0, but it could be a backward compatible mode for GAE + Python 2. CC @crwilcox, do you know if we can upgrade the version of grpcio
in GAE with Python 2?
+1 I'd also greatly appreciate it if grpcio can be upgrade from 1.0.0 to a more recent version in GAE python2.
FYI, I myself am no longer in need of a fix for this, as I was finally able to migrate away from Python 2.7. So, I won't be able to verify any fixes made (and I will not raise hell if this is closed 😉)