telstra/open-kilda

Revert is not functioning correctly after the endpoint swap failed due to links being down at the source switch

izadorozhna opened this issue · 0 comments

The issue was found due to the failed test "Unable to swap endpoints for two flows when one of them is inactive", please see its code for additional details.

Steps to reproduce:

Repeating the steps of "Unable to swap endpoints for two flows when one of them is inactive":

  1. Select 2 pairs of neighboring switches with the same destination switch (e.g. (Switch_1, Switch_2) and (Switch_3, Switch_2)).
  2. Create 2 flows with the same dst switch (e.g. Flow1 with Switch_1 -> Switch_2, and Flow2 with Switch_3 -> Switch_2).
  3. Break all ISLs of the Flow1 source switch (in our example it is Switch_1).
  4. Try to swap endpoints for these 2 flows.
  5. Catch the expected HttpServerErrorException exception because the flows cannot be swapped due to the Flow1 source switch having links down.
  6. Check the flows: their src and dst after the swap failed.
  7. Validate the flows: whether they have any discrepancies.
  8. Validate the involved switches: whether they have any discrepancies.

Expected result:

  • The switches and flows validation should have no discrepancies.
  • The flow1 and flow2 sources should not be the same.
  • Recommended expected behavior in this corner confirmed with @pzakatov: the system should not allow the swap endpoints operation if one of the flows is in DOWN status. The reason for this is that if the Flow1 is in down state, there is some reason for this, so we are not sure that the swap operation will be done successfully. So it is better if the user fixes the flow and brings it to UP before the swap. Since there might be some cases when the flow is down to some other different reasons (e.g. latency), maybe we need to not allow the swap if link is down AND status_info “No path found. Switch … doesn’t have links with enough bandwidth”, but this should be discussed with Pavel Z.
  • If the flows are UP, the swap is started, and something happened during the swap (e.g. the ISLs got broken, the switch became down, etc), the revert should happen. In this revert, we need to try to put the flows in their initial configuration: the same src/dst of the flows as they had before the swap.

Actual result:

After the swap endpoints action is started, the flow-related rules from the switches are deleted. But when the swap operation fails (because flow2 cannot use switch_1 with the links down), the revert operation cannot be done as well because Flow1 cannot be reverted and switch_1 because its links are down).

In this case, the system comes to the incorrect state: both flow1 and flow2 finally have the same source (the same switch, VLAN, port, inner VLAN), for example:

flow1 = `11Apr154551_769_darkchocolate1676`
*00:00:00:00:00:00:00:03, port 9, vlan 3703* --> 00:00:00:00:00:00:00:02, port 8, vlan 1212

flow2 = `11Apr154551_843_quarkquinc6438`  // did not change, expectedly
*00:00:00:00:00:00:00:03, port 9, vlan 3703* --> 00:00:00:00:00:00:00:02, port 10, vlan 1088

Also, the rule is missing on one of the involved switches:
Flow1 has the second rule installed ONLY on switch_2, but is absent either at switch_3 or switch_1:

===== ofsw1 =====
===== ofsw3 =====
===== ofsw2 =====
 cookie=**0x4000000000014764**, duration=79660.425s, table=4, n_packets=0, n_bytes=0, priority=24576,in_port=1,dl_vlan=213 actions=set_field:5308->vlan_vid,output:8

At this time, Flow2 has all the rules installed on switch_2 and switch_3. But if you synchronize Flow1, the rules are fixed for it and installed on both switches, but Flow2 now has the missing rule. And vice-versa if you synchronize the Flow2. And vice-versa if you synchronize the Flow1 again.

Switch_3 validation shows this missing rule as well.

Also, attaching my investigation wit the real case, its ids, time, logs.
Investigation_with_ids_and_time.pdf