open-rmf/rmf-web

[Bug]: server fastapi stalls in production environment

Closed this issue · 1 comments

Before proceeding, is there an existing issue or discussion for this?

OS and version

Ubuntu 22.04

Open-RMF installation type

Source build

Other Open-RMF installation methods

No response

Open-RMF version or commit hash

main, deploy/hammer

ROS distribution

Humble

ROS installation type

Docker

Other ROS installation methods

No response

Package or library, if applicable

No response

Description of the bug

This happens rarely, but increases likelihood when network traffic is increased (more tasks ongoing, hence more task updates over websocket)

Observations

  • dashboard becomes unusable and when refreshed gets a 404, as all the REST calls are pending as monitored on the network tab of browser Inspect
  • fleet adapter logs, broadcast client unable to connect to server URI, starts disconnecting and connecting continuously
  • server logs, show spurious connections from internal websocket (from fleet adapter broadcast client), without proper disconnections, hence the count of internal websockets keep going up
  • server logs, start seeing token expiries
  • server performance, fastapi seems to be the one stalling, consistent with the pending REST calls from the dashboard

Current solution

  • restarting the api-server resets everything and connections become healthy again

Steps to reproduce the bug

I personally have not been able to reproduce it, but according to steps from @koonpeng

Begin quote from @koonpeng

  1. Change packages/api-server/api_server/default_config.py host to 192.168.25.1
  2. On term1, run sudo ip addr add 192.168.25.1/24 dev lo
  3. On term2, cd to packages/dashboard and start the api-server pnpm run start:rmf-server
  4. On term3, start rmf demos with limited cpu ros2 launch rmf_demos_gz office.launch.xml headless:=true server_uri:=ws://192.168.25.1:8000/_internal
  5. On term1, send a patrol task ros2 launch rmf_demos office_patrol.launch.xml, wait a few secs
  6. Then remove the ip, simulating network down sudo ip addr del 192.168.25.1/24 dev lo, wait a few secs
  7. Add the ip back, simulating network recovered sudo ip addr add 192.168.25.1/24 dev lo

After the last step, start got spammed with a lot of

[fleet_adapter-15] [ERROR] [1707276604.567910352] [tinyRobot_fleet_adapter]: BroadcastClient unable to publish message: invalid statewhich finally followed by[fleet_adapter-15] [WARN] [1707276605.301806431] [tinyRobot_fleet_adapter]: BroadcastClient unable to connect to [ws://192.168.25.1:8000/_internal]. Please make sure server is running. Error msg: invalid state
[fleet_adapter-15] [INFO] [1707276605.304396331] [tinyRobot_fleet_adapter]: BroadcastClient successfully connected to uri: [ws://192.168.25.1:8000/_internal]sometimes it stops after one re-connect, but sometimes it get stucked in a loop like we see in prod.

I think we can say that the broadcast client cannot recover from a disconnect, but the question still remains what caused the initial disconnect and the token expiry.

End quote

Expected behavior

  • server continues to serve REST requests without stalling
  • internal websocket connections remain the expected number (1 or 2 depending on whether server_uri was provided to the task_dispatcher

Actual behavior

  • unknown way to reproduce at the moment (currently investigating bad network as a cause)
  • spurious connections from BroadcastClient on the internal websocket route without proper disconnections, causing the websocket count to increase
  • fastapi stalls, all REST calls from the dashboard display pending (from network tab in browser Inspect)
  • dashboard becomes unusable

Additional information or screenshots

No response

Preliminary observation was that this is due to appending events/logs to the task phases, when a task alert is awknowledged

Removal of this feature seem to make the server much more stable

Will keep observing the performance before closing this