stackimpact/stackimpact-python

BufferError with Tornado and asyncio

adamrothman opened this issue · 4 comments

Problem

Since adding the StackImpact agent to a Tornado/asyncio application, we started seeing BufferErrors like this:

BufferError: Existing exports of data: object cannot be re-sized
  File "tornado/web.py", line 1513, in _execute
    self.finish()
  File "tornado/web.py", line 991, in finish
    self.flush(include_footers=True)
  File "tornado/web.py", line 947, in flush
    start_line, self._headers, chunk, callback=callback)
  File "tornado/http1connection.py", line 400, in write_headers
    self._pending_write = self.stream.write(data)
  File "tornado/iostream.py", line 406, in write
    self._handle_write()
  File "tornado/iostream.py", line 872, in _handle_write
    del self._write_buffer[:self._write_buffer_pos]

Environment

Our application runs inside a Docker container. We use:

  • Python 3.6.1
  • tornado 4.5.1
  • stackimpact 1.0.0

We start it like this (simplified for brevity):

import asyncio

from tornado.platform.asyncio import AsyncIOMainLoop
from tornado.httpserver import HTTPServer
from tornado.web import Application


if __name__ == '__main__':
    AsyncIOMainLoop().install()
    loop = asyncio.get_event_loop()

    app = Application(...)
    server = HTTPServer(app)
    server.listen(8080)

    internal_app = Application(...)
    internal_server = HTTPServer(app)
    internal_server.listen(8081)

    loop.run_forever()

Notes

Cursory googling turned up the same error happening in the same place, as reported in this comment on a Tornado pull request. I don't know enough about Tornado's internals to guess at what's going on here but maybe @bdarnell can provide more insight.

In any case, I'm happy to provide more details as needed.

Thanks for submitting the issue with the detailed info!

I've tried with a simple asyncio/tornado app, based on the code you provided, running for some time under high load. I get no errors and the agent reports everything it supposed to.

Is there any specific functionality in your app, which you think might be related and I need to try to reproduce? Any other modules?

One thing to try, if possible in your case, is to disable memory profiler by passing allocation_profiler_disabled = True on agent initialization. Just to make sure it is not caused by python's own allocation profiler, which the agent uses. Other profilers can be disabled as well setting the following flags to True: cpu_profiler_disabled, block_profiler_disabled, error_profiler_disabled.

Please note, we haven't yet worked on a full support of asyncio (only gevent). Especially time and server profilers might be affected.

I deployed a build of our app earlier this morning with allocation_profiler_disabled=True but unfortunately we still ran into the BufferError.

Our app is fairly complex with many 3rd-party modules but it's hard for me to say whether any of them might be contributing. From our requirements.txt:

aiobotocore==0.4.0
aiohttp==2.2.0
aioredis==1.0.0b1
bcrypt==3.1.3
botocore==1.5.70
hiredis==0.2.0
neo4j-driver==1.3.0
otherxyz-awslib==0.4.6
otherxyz-flowlib==0.4.1
otherxyz-srvlib==0.4.9
otherxyz-utils==0.2.21
phonenumbers==8.5.2
prometheus_client==0.0.19
PyStaticConfiguration==0.10.3
PyYAML==3.12
raven==6.1.0
stackimpact==1.0.0
tenacity==4.2.0
tornado==4.5.1

Most of the app is asynchronous. It communicates with clients via REST and WebSocket. We use asyncio-native libraries to interface with upstream services where possible (AWS via aiobotocore, Redis via aioredis). Without a good, native asyncio Neo4j driver available we run synchronous calls to neo4j-driver in a ThreadPoolExecutor like this (again, simplified for brevity):

import asyncio

from neo4j.v1 import GraphDatabase


class Neo4j:

    _driver = None

    @classmethod
    def get_driver(cls):
        if cls._driver is None:
            cls._driver = GraphDatabase.driver(...)
        return cls._driver

    async def run(cls, statement, *args, **kwargs):
        driver = cls.get_driver()

        def run():
            cls.pending_metric.inc()

            start = time()
            with driver.session() as session:
                result = session.run(statement, *args, **kwargs)
            latency = time() - start

            cls.pending_metric.dec()
            cls.latency_metric.observe(latency)

            return result

        loop = asyncio.get_event_loop()
        # Passing None uses the default executor
        return await loop.run_in_executor(None, run)

Apart from the above (which has never given us trouble), we had no other use of threading in the app prior to adding the StackImpact agent.

I was able to reproduce the issue, with and without the agent. What happens is that if we do current_frames = sys._current_frames() inside of a signal handler, the current_frames local variable isn't cleaned up soon enough, which causes the memoryview in iostream.py to complain, probably because there are some references kept in the thread frames longer than expected.

Cleaning up local variables explicitly in signal handler fixes the problem. The fix is released in agent v1.0.3.

Awesome, thank you! I'll try it out again today or Monday and report back.