HostSubnet handling is subject to multi-master race conditions

Question

HostSubnet handling is subject to multi-master race conditions

Closed this issue 5 years ago · 6 comments

@liggitt points out that the timeline in #11628 (comment) is not actually guaranteed safe, because it's possible that the HostSubnets().Get() call at t=4 will hit an etcd that hasn't yet observed the deletion that occurred at t=3, which will then cause us to do the wrong thing. The fix for this is to keep a cache and act based on whether the HostSubnet is still in the cache, rather than whether it's known to the API server.

We may also be making this same mistake in other places.

In fact, EventQueue keeps a cache internally already, we just need to tweak RunEventQueue() to expose it, and then make use of it. (See also node.go:watchServices(), where we are manually keeping a cache that is redundant with the EventQueue's.)

Answer 1 · 2016-11-02T19:37:29.000Z

@danwinship instead of the internal cache, use NewEventQueueForStore()

Answer 2 · 2016-11-03T13:53:43.000Z

to clarify, this only causes problems if you delete a node and then immediately recreate it again faster than the etcds can sync up (which is not something you'd really have any reason to do), and afaik the bug has been around since 1.0

Answer 3 · 2016-11-03T14:03:27.000Z

Tracking this with https://trello.com/c/B56OdzdS

Since it is has been that way since 1.0 and is too risky to fix at the moment, dropping the priority.

Answer 4 · 2018-02-21T11:11:39.000Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Answer 5 · 2018-02-21T15:13:27.000Z

oh, hello random old bug that probably explains the weird HostSubnet behavior we were seeing on Online (#18617).

cc @pravisankar because I think you were digging into this?

/lifecycle frozen

Answer 6 · 2019-08-19T14:50:04.000Z

No idea if this is still an issue, but migrated to openshift/sdn#25