sonic-net/sonic-sairedis

Warmboot fails for FDB attributes

tbgowda opened this issue · 3 comments

Hi,

I am noticing a behavior that I am struggling to explain. The test case is test_advanced_reboot.py::test_warm_reboot_sad. The scenario is generally as follows:

a. Flood DUT with FDB notifications (about 524 entries)
b. Execute warm-reboot
c. Watch warm reboot reconcile fail with the following error:

Aug 30 02:21:27.876402 mathilda-4 ERR syncd#syncd: :- asic_process_event: field: SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID, value: oid:0x3a000000000c4a
Aug 30 02:21:27.876402 mathilda-4 ERR syncd#syncd: :- asic_process_event: failed to execute api: set, key: SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x26000000000c3d","mac":"72:06:00:01:00:12","switch_id":"oid:0x21000000000000"}, status: SAI_STATUS_ITEM_NOT_FOUND
Aug 30 02:21:27.876421 mathilda-4 NOTICE syncd#syncd: :- executeOperationsOnAsic: asic apply took 0.177875 sec
Aug 30 02:21:27.876421 mathilda-4 ERR syncd#syncd: :- executeOperationsOnAsic: Error while executing asic operations, ASIC is in inconsistent state: :- asic_process_event: failed to execute api: set, key: SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x26000000000c3d","mac":"72:06:00:01:00:12","switch_id":"oid:0x21000000000000"}, status: SAI_STATUS_ITEM_NOT_FOUND

The issue however doesn't always happen. I have logs from both a successful case and a failure case. When I looked at the logs, I noticed that in case of failure, the SAI_OBJECT_TYPE_FDB_ENTRY is being set on the ASIC while in case of success no operation is done on the ASIC with regards to the aforementioned object.

Looking closer, it looks like the syncd decides to "set" because it finds the attribute SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID has changed in the temp view compared to the current view.

Now, the immediate question that comes to mind is, is it possible that SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID can change before warm-reboot has reconciled? I understand that syncd stops FDB learning on the ASIC before going down.

Also, if that is indeed a possibility, shouldn't there be a create for the attribute first before a set since the port ID changed?

The end result of the comparison logic is this:

Successful case:

Aug 30 02:04:30.671871 mathilda-4 NOTICE syncd#syncd: :- compareViews: ASIC operations to execute: 44

Failure case:

Aug 30 02:21:27.692765 mathilda-4 NOTICE syncd#syncd: :- compareViews: ASIC operations to execute: 533

I am attaching the logs for both the success and the failure case, any insight will really help. Thanks.

Version: azure/202012

failure.zip
success.zip

im not sure about this particular scenario, but on warm boot scenario FDBs should not change, since fdb learning is explicitly disabled on all ports, to avoid scenarios like mac move or mac age/learn on any port and avoid syncd confusion. you can explicitly see this in recording as "SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE=SAI_BRIDGE_PORT_FDB_LEARNING_MODE_DISABLE"

i took a look at failure scenarion and indeed disable happens for mac addres taht failed:

2021-08-30.02:17:54.195524|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000c3d\",\"mac\":\"72:06:00:01:00:12\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000c4c"}]}]|
2021-08-30.02:20:08.988478|s|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000c4c|SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE=SAI_BRIDGE_PORT_FDB_LEARNING_MODE_DISABLE

then later on that mac address becomes explicitly created:

2021-08-30.02:21:21.958780|c|SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000010dc","mac":"72:06:00:01:00:12","switch_id":"oid:0x21000000000000"}|SAI_FDB_ENTRY_ATTR_TYPE=SAI_FDB_ENTRY_TYPE_DYNAMIC|SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID=oid:0x3a0000000010e9

but here may be the problem:

2021-08-30.02:21:20.106504|c|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a0000000010dd|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT|SAI_BRIDGE_PORT_ATTR_PORT_ID=oid:0x1000000000003|SAI_BRIDGE_PORT_ATTR_ADMIN_STATE=true|SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE=SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW
...
2021-08-30.02:21:20.159809|c|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a00000000110b|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT|SAI_BRIDGE_PORT_ATTR_PORT_ID=oid:0x100000000000b|SAI_BRIDGE_PORT_ATTR_ADMIN_STATE=true|SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE=SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW

mac learning is enabled before explicit create of MAC will happen, but this should not matter since all of those are in the initial segment

during apply view transition we see some errors from vendor sdk:

626 Aug 30 02:21:27.867369 mathilda-4 ERR syncd#syncd: SAI_API_FDB: New MAC installed:  40a6b736af0f
627 Aug 30 02:21:27.871695 mathilda-4 ERR syncd#syncd: SAI_API_FDB: New MAC entry, not yet installed:  720600010012

i have not idea what they mean and how to interpret them
but probably cause this issue:

Aug 30 02:21:27.876402 mathilda-4 ERR syncd#syncd: :- asic_process_event: failed to execute api: set, key: SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x26000000000c3d","mac":"72:06:00:01:00:12","switch_id":"oid:0x21000000000000"}, status: SAI_STATUS_ITEM_NOT_FOUND

which crash syncd, while wanting to change bridge port on existing fdb entry with mac address, but that entry should exists already since is has been learned before warm boot, so maybe this is vendor issue

and that problem you are seeing this sometimes, probably related to some timing issue, maybe asic sync, who knows :/

Thanks for your comments.

Perhaps, what I am trying to get at is, what can cause a set operation on the SAI_OBJECT_TYPE_FDB_ENTRY object during warmboot ?

And since it is not consistent, I am looking for any signs in syncd to indicate under what circumstances can a set operation is called for the FDB object. As you can see from the success log as well, there is no change in sequence of events.

From success log

2021-08-30.02:04:24.703692|c|SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x26000000000c3d","mac":"72:06:00:01:00:12","switch_id":"oid:0x21000000000000"}|SAI_FDB_ENTRY_ATTR_TYPE=SAI_FDB_ENTRY_TYPE_DYNAMIC|SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID=oid:0x3a000000000c4a
2021-08-30.02:04:22.935809|c|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000c3e|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT|SAI_BRIDGE_PORT_ATTR_PORT_ID=oid:0x1000000000003|SAI_BRIDGE_PORT_ATTR_ADMIN_STATE=true|SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE=SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW
...
2021-08-30.02:04:22.991256|c|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000c6c|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT|SAI_BRIDGE_PORT_ATTR_PORT_ID=oid:0x100000000000b|SAI_BRIDGE_PORT_ATTR_ADMIN_STATE=true|SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE=SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW

And as you say, this is all happening in the init view, which I presume means no commands are sent to the ASIC till apply transition.

Yet, in the success case, no FDB operations are done during the apply transition. I am unable to explain why there were FDB operations in failure case.

Perhaps, what I am trying to get at is, what can cause a set operation on the SAI_OBJECT_TYPE_FDB_ENTRY object during warmboot ?

libsai bug or hardware configuration bug

And as you say, this is all happening in the init view, which I presume means no commands are sent to the ASIC till apply transition.
yes no commands are sent, but fail is happening at the apply view stage, where diff changes are pushed to the asic