sonic-net/sonic-sairedis

[202205] FDB learning caused orchagent exited when reboot

shuaishang opened this issue · 7 comments

SONiC 202205, the orchagent (portsorch) crashed when init occasionally.
The reason is that FDB learned when reboot and it add reference count for default port bridge id.
Then portsorch tried to removeDefaultBridgePorts but failed because of reference count.

8783 2023 Jul 25 19:16:59.229376 NOTICE swss#orchagent: removeDefaultVlanMembers:801: Remove 34 VLAN members from default VLAN
8784 2023 Jul 25 19:16:59.254058 ERR swss#orchagent: meta_generic_validation_remove:2978: object 0x3a000000000082 reference count is 1, can't remove
8785 2023 Jul 25 19:16:59.254058 ERR swss#orchagent: removeDefaultBridgePorts:851: Failed to remove bridge port, rv:-17
8786 2023 Jul 25 19:16:59.255135 INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
8787 2023 Jul 25 19:16:59.255135 INFO swss#supervisord: orchagent   what():  PortsOrch initialization failure

SONiC 201911 fixed this issue before:
#572

But this fix was removed in 202205 and master.

@stephenxs Do you have any idea about this issue? Why 202205 and master branch delete the fix...?

fdb learning is disabled before reboot, so there should be no learning message, unless this was race condition, do you have syslog and sairedis.rec from that timestamp ?

When system boot up, the default behavior of fdb learning depends on vendor SAI/SDK.
There is no chance for orchagent to disable it, before "PortsOrch::PortsOrch" called "removeDefaultVlanMembers".
For our system, we do saw a FDB event after create switch immediately:

2023-07-25.19:16:53.783625|A|SAI_STATUS_SUCCESS 2023-07-25.19:16:53.786185|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x55c0a7627db0|SAI_SWITCH_ATTR_FDB_UNICAST_MISS_PACKET_ACTION=SAI_PACKET_ACTION_DROP|SAI_SWITCH_ATTR_FDB_BROADCAST_MISS_PACKET_ACTION=SAI_PACKET_ACTION_DROP|SAI_SWITCH_ATTR_FDB_MULTICAST_MISS_PACKET_ACTION=SAI_PACKET_ACTION_DROP|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x55c0a7627dc0|SAI_SWITCH_ATTR_BFD_SESSION_STATE_CHANGE_NOTIFY=0x55c0a7627f30|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x55c0a7627de0|SAI_SWITCH_ATTR_QUEUE_PFC_DEADLOCK_NOTIFY=0x55c0a7627e50|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=00:A0:C9:12:34:56|SAI_SWITCH_ATTR_CAPABILITY_EXTENSION=1:3204448703 2023-07-25.19:16:53.786632|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0 2023-07-25.19:16:59.154007|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x3000000000024 2023-07-25.19:16:59.154041|n|port_state_change|[{"port_id":"oid:0x1000000000012","port_state":"SAI_PORT_OPER_STATUS_UP"}]| 2023-07-25.19:16:59.154623|n|port_state_change|[{"port_id":"oid:0x1000000000004","port_state":"SAI_PORT_OPER_STATUS_UP"}]| 2023-07-25.19:16:59.154660|c|SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x6000000000649|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x3000000000024|SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_LOOPBACK|SAI_ROUTER_INTERFACE_ATTR_MTU=9100 2023-07-25.19:16:59.154724|n|port_state_change|[{"port_id":"oid:0x1000000000023","port_state":"SAI_PORT_OPER_STATUS_UP"}]| 2023-07-25.19:16:59.154755|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000031\",\"mac\":\"52:54:00:A1:C3:B0\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000082"}]}]| 2023-07-25.19:16:59.154837|n|port_state_change|[{"port_id":"oid:0x1000000000012","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]| 2023-07-25.19:16:59.154851|n|port_state_change|[{"port_id":"oid:0x1000000000023","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]| 2023-07-25.19:16:59.156049|q|attribute_capability|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|OBJECT_TYPE=SAI_OBJECT_TYPE_QUEUE|ATTR_ID=SAI_QUEUE_ATTR_PFC_DLR_INIT 2023-07-25.19:16:59.157095|Q|attribute_capability|SAI_STATUS_SUCCESS|OBJECT_TYPE=SAI_OBJECT_TYPE_QUEUE|ATTR_ID=SAI_QUEUE_ATTR_PFC_DLR_INIT|CREATE_IMP=false|SET_IMP=true|GET_IMP=false 2023-07-25.19:16:59.157154|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_MAX_NUMBER_OF_TEMP_SENSORS=0 2023-07-25.19:16:59.157355|G|SAI_STATUS_NOT_SUPPORTED| 2023-07-25.19:16:59.157422|q|attribute_capability|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|OBJECT_TYPE=SAI_OBJECT_TYPE_PORT|ATTR_ID=SAI_PORT_ATTR_TPID 2023-07-25.19:16:59.157612|Q|attribute_capability|SAI_STATUS_SUCCESS|OBJECT_TYPE=SAI_OBJECT_TYPE_PORT|ATTR_ID=SAI_PORT_ATTR_TPID|CREATE_IMP=true|SET_IMP=true|GET_IMP=true 2023-07-25.19:16:59.157644|q|attribute_capability|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|OBJECT_TYPE=SAI_OBJECT_TYPE_LAG|ATTR_ID=SAI_LAG_ATTR_TPID 2023-07-25.19:16:59.157836|Q|attribute_capability|SAI_STATUS_SUCCESS|OBJECT_TYPE=SAI_OBJECT_TYPE_LAG|ATTR_ID=SAI_LAG_ATTR_TPID|CREATE_IMP=false|SET_IMP=false|GET_IMP=false 2023-07-25.19:16:59.163658|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_CPU_PORT=oid:0x0 2023-07-25.19:16:59.164116|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_CPU_PORT=oid:0x1000000000034

OA explicitly turns off FDB learning before reboot so this situation not happen, maybe this is some other scenario rather than reboot ? maybe this is unexpected reboot ?

please attach full syslog and sairedis log from that day/event take a look from your paste:

19:16:53.786185|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000
19:16:59.154041|n|port_state_change|[{"port_id":"oid:0x1000000000012","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
19:16:59.154623|n|port_state_change|[{"port_id":"oid:0x1000000000004","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
19:16:59.154724|n|port_state_change|[{"port_id":"oid:0x1000000000023","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
19:16:59.154755|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000031\",\"mac\":\"52:54:00:A1:C3:B0\",\"
  1. swithch is created
  2. some ports get up
  3. fdb event is learned - this means that fdb was not disabled by OA in the first place, this would suggest that switch was shutdown not in a good way or OA crashed, that's why we need syslog to confirm that

@prsunny maybe we need a special case scenario here for this kind of behavior in swss

Hi @kcudnik ,

Appreciated for your comments.
Whatever the OA configured the learning mode, in a cold reboot, the vendor SAI/SDK will not care the previous setting.
SDK will init the switch from scratch when OA create switch.

Thanks

if it's cold boot, then all ports should be down by default, and from sairedis recordings seems like you get port up notification, so ports were administrative UP, which should not be the case in cold boot scenario.

Again, please attach syslog aroutd this boot +/- extra few minutes so we could analyze what happened

if it's cold boot, then all ports should be down by default, and from sairedis recordings seems like you get port up notification, so ports were administrative UP, which should not be the case in cold boot scenario.

Again, please attach syslog aroutd this boot +/- extra few minutes so we could analyze what happened

Agree with Kamil. Also orchagent removes all port from Bridge and default Vlan member association. MAC learning is not expected to happen in normal cold boot.