openshift/origin

HostSubnet handling is subject to multi-master race conditions

Closed this issue · 6 comments

@liggitt points out that the timeline in #11628 (comment) is not actually guaranteed safe, because it's possible that the HostSubnets().Get() call at t=4 will hit an etcd that hasn't yet observed the deletion that occurred at t=3, which will then cause us to do the wrong thing. The fix for this is to keep a cache and act based on whether the HostSubnet is still in the cache, rather than whether it's known to the API server.

We may also be making this same mistake in other places.

In fact, EventQueue keeps a cache internally already, we just need to tweak RunEventQueue() to expose it, and then make use of it. (See also node.go:watchServices(), where we are manually keeping a cache that is redundant with the EventQueue's.)

dcbw commented

@danwinship instead of the internal cache, use NewEventQueueForStore()

to clarify, this only causes problems if you delete a node and then immediately recreate it again faster than the etcds can sync up (which is not something you'd really have any reason to do), and afaik the bug has been around since 1.0

Tracking this with https://trello.com/c/B56OdzdS

Since it is has been that way since 1.0 and is too risky to fix at the moment, dropping the priority.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

oh, hello random old bug that probably explains the weird HostSubnet behavior we were seeing on Online (#18617).

cc @pravisankar because I think you were digging into this?

/lifecycle frozen

No idea if this is still an issue, but migrated to openshift/sdn#25