threefoldtech/mycelium

Sporadic loss of routes

Closed this issue · 3 comments

It seems that routes can be lost sporadically for some time. One observed instance is when 2 nodes were connected to all 6 public nodes. At times, 1 node lost it's selected route for the other node. This node is physically close to the 2 belgian peers (< 1ms latency), in case that matters. In the other direction the subnet was not lost (or at least it is being lost a lot less).

Note that babel does have instances were valid routes exist which aren't immediately selected, ususally a seqno request is sent here.

After cross referencing logs on the public nodes, it seems no retractions are sent from there. So that means the route loss is in fact happening in the node itself

Code to send a seqno request in this case has already been added some time ago, though there is no release of this yet

After some more debugging, what seems to be happening is the following:

  • There is a selected route for some subnet, with a given metric.
  • The metric decreases as a result of an update. The route entry is updated with the new metric but no update is triggered.
  • A new update comes in for the route with a higher metric. The update is validated against the source table and found to be feasible.
  • After this, but before the route entry is updated in the route table, an update is sent out for the subnet. This uses the last recorded metric, which is lower than the announced metric in the source table. As a result, the metric in the source table decreases.
  • The route entry is updated as part of the update, and route selection is ran.
  • Due to the update we sent in the meantime, the route entry is now unfeasible, and the route is lost

All in all, the problem is that the source table is not locked for the duration of the update call. This would however create additional contention on the source table. For now we will not do anything and see how the additional seqno requests work out.