Port #426 Autobahn WebSocket server from ros1 to ros2
zflat opened this issue · 11 comments
Description
The issue described in #425 and fixed in #426 still persists in ros2 because the fix was not ported to ros2. There was a request made in #426 (comment) but I do not see any issue tracking the progress of porting the changes to ros2.
- Library Version:
1.1.2-1focal.20220126.193440
- ROS Version: Rolling
- Platform / OS: Ubuntu 20.04
Steps To Reproduce
I can reproduce issues by performing rapid connect/disconnect while topics are actively being published. I launch the websocket server while other nodes are publishing. Then I have a browser web page that opens a websocket connection and subscribes to topics using https://github.com/RobotWebTools/roslibjs. I can refresh the page rapidly a few times and then observe the websocket server print errors and then no longer accept new websocket connections.
Expected Behavior
The server should recover from websocket disconnect errors mid-write without locking the entire websocket server.
Actual Behavior
The server becomes locked due to the _write_lock
never being released because of exception handling logic in the prewrite_message
method.
Here is an example trace of the exception:
ERROR:tornado.application:Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f75d8f4d700>, <Future finished exception=WebSocketClosedError()>)
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/tornado/websocket.py", line 867, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/usr/lib/python3/dist-packages/tornado/websocket.py", line 846, in _write_frame
return self.stream.write(frame)
File "/usr/lib/python3/dist-packages/tornado/iostream.py", line 570, in write
self._check_closed()
File "/usr/lib/python3/dist-packages/tornado/iostream.py", line 1112, in _check_closed
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/usr/lib/python3/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 779, in _discard_future_result
future.result()
File "/usr/lib/python3/dist-packages/tornado/gen.py", line 326, in wrapper
yielded = next(result)
File "/opt/colcon_ws/install/rosbridge_server/lib/python3.8/site-packages/rosbridge_server/websocket_handler.py", line 197, in prewrite_message
future = self.write_message(message, binary)
File "/usr/lib/python3/dist-packages/tornado/websocket.py", line 262, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/usr/lib/python3/dist-packages/tornado/websocket.py", line 869, in write_message
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
ERROR:tornado.application:Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f75d8f2d0d0>, <Future finished exception=WebSocketClosedError()>)
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/usr/lib/python3/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 779, in _discard_future_result
future.result()
File "/usr/lib/python3/dist-packages/tornado/gen.py", line 326, in wrapper
yielded = next(result)
File "/opt/colcon_ws/install/rosbridge_server/lib/python3.8/site-packages/rosbridge_server/websocket_handler.py", line 197, in prewrite_message
future = self.write_message(message, binary)
File "/usr/lib/python3/dist-packages/tornado/websocket.py", line 259, in write_message
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
Hey @zflat,
i had excatly the same problems that you mentioned. I improved it a little bit with #741 but it seems that there are still some unsolved problems. On top i had a mysterious bug with CycloneDDS on Foxy which caused a segmentation fault on destroy_subscription(). When switching to FastRTPS the bug you descripted occurs less often for me.
All in all: I would totally support the switch to Autobahn server!
I think a good start for this task would be to create an integration test that rapidly creates new clients to demonstrate the problem. If the issue happens fairly reliably then we should be able to reproduce it in a test.
@jtbandes I just opened #745 to show a test that reproduces this issue. I chose the rate and message size and number of subscribers as a starting point that showed the issue on my machine. A smaller message size, fewer subscribers and slower rate may still show the error but I did not try to see where the limit was on my machine.
I don't know if we want to actually keep the test in #745 but it at least illustrates the issue I described.
@jtbandes I have tested the changes in #741 and when I am on that branch I am unable to reproduce the issue I have observed. I think that in addition to mergin in the fixes from #741, we should still consider converting to AutoBahn Websocket. The reason is because the bug fix in #741 was pretty subtle and hard to reproduce in testing. I think that the websocket server should be able to recover from Exceptions being thrown. Even if a specific client or request is not handled due to a bug like the one found in #741 the server should still be able to provide functionality to other clients.
Has someone attempted a port already? Otherwise I will probably take this up
I don't know of any recent attempt. Maybe @jtbandes or another maintainer would know otherwise?
I've done some testing of the branch on #741 (with galactic and Fast-DDS) and it seems to make things MUUUCH more reliable. Unfortunately with enough very fast refreshing of clients it occurs again (it usually takes me about 30s of very fast refreshing to trigger it).
The server then throws the following error on each message it tries to send.
[rosbridge_websocket-3] ERROR:tornado.application:Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7ff49212a0d0>, <Future finished exception=WebSocketClosedError()>)
[rosbridge_websocket-3] Traceback (most recent call last):
[rosbridge_websocket-3] File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 758, in _run_callback
[rosbridge_websocket-3] ret = callback()
[rosbridge_websocket-3] File "/usr/lib/python3/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
[rosbridge_websocket-3] return fn(*args, **kwargs)
[rosbridge_websocket-3] File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 779, in _discard_future_result
[rosbridge_websocket-3] future.result()
[rosbridge_websocket-3] File "/usr/lib/python3/dist-packages/tornado/gen.py", line 326, in wrapper
[rosbridge_websocket-3] yielded = next(result)
[rosbridge_websocket-3] File "/opt/greenroom/whiskey_ros_gs_external/lib/python3.8/site-packages/rosbridge_server/websocket_handler.py", line 197, in prewrite_message
[rosbridge_websocket-3] future = self.write_message(message, binary)
[rosbridge_websocket-3] File "/usr/lib/python3/dist-packages/tornado/websocket.py", line 259, in write_message
[rosbridge_websocket-3] raise WebSocketClosedError()
[rosbridge_websocket-3] tornado.websocket.WebSocketClosedError
Interestingly I have not yet seen the server become unresponsive - it is just littered with these errors.
It looks like you are writing a fix achim-k? If I can help you test please let me know.
I think it makes sense to get #741 in first, before trying to migrate to autobahn. I will look into that
I tried unsuccessfully to get Rosbridge to work stable with ROS2, and ended up writing bridge in C++, it's ROS2 only (tested on Galactic):
https://github.com/v-kiniv/rws
Give it a try if you have problems with Rosbridge and ROS2.
p.s. Sorry for the off-topic comment, but I think it might useful for others who will came here looking for a solution for the Rosbridge+ROS2 issues.
Similarly, Foxglove has also been working on a C++ bridge: https://github.com/foxglove/ros-foxglove-bridge It uses the Foxglove WebSocket protocol rather than the rosbridge protocol.
This issue has been marked as stale because there has been no activity in the past 12 months. Please add a comment to keep it open.