[202405][dualtor] Orchagent is going down during switchover
Opened this issue · 2 comments
Description
When performing a switchover (say active to standby or viceversa), we are observing orchagent process going down and thus leaving mux status in inconsistent state.
Based on the observations from debug logs, we thought using bulker for programming the routes/neighbors (introduced by PR: #3148) is the problem and confirmed the same by running the tests after reverting the PR changes.
Steps to reproduce the issue:
- Run any sonic-mgmt test (Ex:
tests/dualtor_io/test_link_failure.py
) performing switchover (say usingtoggle_all_simulator_ports_to_rand_selected_tor
or similar fixture which performs switchover during test setup).
Describe the results you received:
- Tests will fail with
Failed to toggle all ports to <tor_device> from mux simulator
as mux status will be left in inconsistent state.
def _toggle_all_simulator_ports_to_target_dut(target_dut_hostname, duthosts, mux_server_url, tbinfo):
"""Helper function to toggle all ports to active on the target DUT."""
...
if not is_toggle_done and \
not utilities.wait_until(120, 10, 0, _check_toggle_done, duthosts, target_dut_hostname, probe=True):
> pytest_assert(False, "Failed to toggle all ports to {} from mux simulator".format(target_dut_hostname))
E Failed: Failed to toggle all ports to ld301 from mux simulator```
- Orchagent process in swss docker container will be down (can we verified with
ps aux
inside swss container)
Describe the results you expected:
Switchover should have completed without any failures.
Additional information you deem important:
Some of the debug logs captured during the switchover,
2024 Sep 18 17:47:12.847339 gd377 NOTICE swss#orchagent: :- nbrHandler: Processing neighbors for mux Ethernet200, enable 0, state 2
2024 Sep 18 17:47:12.847339 gd377 INFO swss#orchagent: :- updateRoutes: Updating routes pointing to multiple mux nexthops
...
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- addRoutes: Adding route entry 192.168.0.44, nh 400000000167a to bulker
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- create_entry: EntityBulker.create_entry 1, 2, 1
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- addRoutes: Adding route entry fc02:1000::2c, nh 400000000167a to bulker
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- create_entry: EntityBulker.create_entry 2, 2, 1
2024 Sep 18 17:47:12.851834 gd377 DEBUG swss#orchagent: :> redis_bulk_create_route_entry: enter
2024 Sep 18 17:47:12.851834 gd377 DEBUG swss#orchagent: :> bulkCreate: enter
...
...
2024 Sep 18 17:47:12.881418 gd377 DEBUG swss#orchagent: :> waitForBulkResponse: enter
...
...
2024 Sep 18 17:47:12.886416 gd377 DEBUG swss#orchagent: :- processReply: got message: ["switch_shutdown_request","{\"switch_id\":\"oid:0x21000000000000\"}"]
...
...
2024 Sep 18 17:48:12.935572 gd377 DEBUG swss#orchagent: :> on_switch_shutdown_request: enter
2024 Sep 18 17:48:12.935597 gd377 ERR swss#orchagent: :- on_switch_shutdown_request: Syncd stopped
2024 Sep 18 17:48:12.946670 gd377 INFO swss#supervisord 2024-09-18 17:48:12,945 WARN exited: orchagent (exit status 1; not expected)
Based on the debug logs captured during multiple test runs we suspected usage of bulker
entity is causing orchagent to go down for some reason. And tried running the tests by reverting PR #3148 :[muxorch] Using bulker to program routes/neighbors during switchover
and tests are passing.
@prsunny @Ndancejic can you assess this issue?
@bingwang-ms FYI