BufferError with Tornado and asyncio
adamrothman opened this issue · 4 comments
Problem
Since adding the StackImpact agent to a Tornado/asyncio application, we started seeing BufferError
s like this:
BufferError: Existing exports of data: object cannot be re-sized
File "tornado/web.py", line 1513, in _execute
self.finish()
File "tornado/web.py", line 991, in finish
self.flush(include_footers=True)
File "tornado/web.py", line 947, in flush
start_line, self._headers, chunk, callback=callback)
File "tornado/http1connection.py", line 400, in write_headers
self._pending_write = self.stream.write(data)
File "tornado/iostream.py", line 406, in write
self._handle_write()
File "tornado/iostream.py", line 872, in _handle_write
del self._write_buffer[:self._write_buffer_pos]
Environment
Our application runs inside a Docker container. We use:
- Python 3.6.1
tornado
4.5.1stackimpact
1.0.0
We start it like this (simplified for brevity):
import asyncio
from tornado.platform.asyncio import AsyncIOMainLoop
from tornado.httpserver import HTTPServer
from tornado.web import Application
if __name__ == '__main__':
AsyncIOMainLoop().install()
loop = asyncio.get_event_loop()
app = Application(...)
server = HTTPServer(app)
server.listen(8080)
internal_app = Application(...)
internal_server = HTTPServer(app)
internal_server.listen(8081)
loop.run_forever()
Notes
Cursory googling turned up the same error happening in the same place, as reported in this comment on a Tornado pull request. I don't know enough about Tornado's internals to guess at what's going on here but maybe @bdarnell can provide more insight.
In any case, I'm happy to provide more details as needed.
Thanks for submitting the issue with the detailed info!
I've tried with a simple asyncio/tornado app, based on the code you provided, running for some time under high load. I get no errors and the agent reports everything it supposed to.
Is there any specific functionality in your app, which you think might be related and I need to try to reproduce? Any other modules?
One thing to try, if possible in your case, is to disable memory profiler by passing allocation_profiler_disabled = True
on agent initialization. Just to make sure it is not caused by python's own allocation profiler, which the agent uses. Other profilers can be disabled as well setting the following flags to True: cpu_profiler_disabled
, block_profiler_disabled
, error_profiler_disabled
.
Please note, we haven't yet worked on a full support of asyncio (only gevent). Especially time and server profilers might be affected.
I deployed a build of our app earlier this morning with allocation_profiler_disabled=True
but unfortunately we still ran into the BufferError
.
Our app is fairly complex with many 3rd-party modules but it's hard for me to say whether any of them might be contributing. From our requirements.txt
:
aiobotocore==0.4.0
aiohttp==2.2.0
aioredis==1.0.0b1
bcrypt==3.1.3
botocore==1.5.70
hiredis==0.2.0
neo4j-driver==1.3.0
otherxyz-awslib==0.4.6
otherxyz-flowlib==0.4.1
otherxyz-srvlib==0.4.9
otherxyz-utils==0.2.21
phonenumbers==8.5.2
prometheus_client==0.0.19
PyStaticConfiguration==0.10.3
PyYAML==3.12
raven==6.1.0
stackimpact==1.0.0
tenacity==4.2.0
tornado==4.5.1
Most of the app is asynchronous. It communicates with clients via REST and WebSocket. We use asyncio-native libraries to interface with upstream services where possible (AWS via aiobotocore
, Redis via aioredis
). Without a good, native asyncio Neo4j driver available we run synchronous calls to neo4j-driver
in a ThreadPoolExecutor
like this (again, simplified for brevity):
import asyncio
from neo4j.v1 import GraphDatabase
class Neo4j:
_driver = None
@classmethod
def get_driver(cls):
if cls._driver is None:
cls._driver = GraphDatabase.driver(...)
return cls._driver
async def run(cls, statement, *args, **kwargs):
driver = cls.get_driver()
def run():
cls.pending_metric.inc()
start = time()
with driver.session() as session:
result = session.run(statement, *args, **kwargs)
latency = time() - start
cls.pending_metric.dec()
cls.latency_metric.observe(latency)
return result
loop = asyncio.get_event_loop()
# Passing None uses the default executor
return await loop.run_in_executor(None, run)
Apart from the above (which has never given us trouble), we had no other use of threading in the app prior to adding the StackImpact agent.
I was able to reproduce the issue, with and without the agent. What happens is that if we do current_frames = sys._current_frames()
inside of a signal handler, the current_frames
local variable isn't cleaned up soon enough, which causes the memoryview
in iostream.py to complain, probably because there are some references kept in the thread frames longer than expected.
Cleaning up local variables explicitly in signal handler fixes the problem. The fix is released in agent v1.0.3.
Awesome, thank you! I'll try it out again today or Monday and report back.