Encountered: "BUG! Probe cycle finished without running its full course"
jeromegn opened this issue · 6 comments
I figured I might bring this up since it was labeled "BUG".
I'm not sure what can cause this, I've been restarting a few instances and encountered it. It seems to have recovered from it.
From my logs, I see this sequence of things happened:
Notification::Rejoin
Notification::Active
and then:
ERROR foca: error handling timer: BUG! Probe cycle finished without running its full course
Oooh this is cool! The notification log lines are super useful, thanks a lot!
This error is meant to catch mistakes wiring timer events - I put it in place because I knew I'd mess something up when adding auto rejoin. I wonder if this happened because the rejoin happened right in the middle of a probe cycle.. don't think I've mapped this scenario to a test.
What I'm trying to catch there are incomplete cycles that may lead to the instance suspecting a member is down needlessly- If the cluster is working fine, this just means potential extra noise from false positive suspicions (a few extra messages being exchanged so that the suspected member can remain alive in the cluster), but if the whole cluster is experiencing issues, this added noise may become problematic.
I'll try to write a test that reproduces this as soon as free time comes- It's probably (hopefully 😅 ) a missing cleanup somewhere.
Landed the fix, thanks again for the report! There are more details in the PR/commits, but it's as I guessed previously: missing cleanup, could've made reasoning difficult during cluster issues but a healthy cluster would barely notice the added noise caused by this scenario.
I'll see if I can get the time to wrap #3 up before Monday EOD then ship it all together, otherwise I'll release a 0.1.1
just with the bugfix by then.
I've started seeing these again:
foca: error handling timer: BUG! Probe cycle finished without running its full course
However this time I don't think I have any interesting logs surrounding it! Mostly because I'm running this in production with a not-too-verbose log level.
Ugh this log line should've at least contained the event / timer_token used but I messed up defining the span on debug level :( Not quite sure what could cause this yet (my hunch is that this is again related to auto rejoin) - I'll save some time Monday to see if I can map it out properly. At the very least I'll fix the traces to include more info when warning
Thanks for the report! Reopening
Found a bug which I expect is what triggered this new occurrence. Pretty similar to the previous case: In the previous the issue was missing cleanup when the cluster declared the instance down; This one was missing a cleanup when the instance perceives the cluster as empty (A cluster-wide restart could've triggered it, for example)
One thing that I realized as I fixed is that while foca would keep working Ok after emitting this error, it would never resume probing. I'm fixing that too on #13. Will merge and cut a new release soon.
As for the tracing thing I mentioned in the previous comment: I thought the error log was coming from a trace, but it isn't. I've opened #14 to address this properly sometime in the future.
Thank you! Just upgraded, will let you know if I encounter something like this again :)