[BUG] OnionBalancedService periodically stops working, resulting in Onion Service not being found
conneryn opened this issue · 3 comments
Describe the bug
After running an OnionBalancedService
for a period of time, eventually the onion address is no longer resolvable.
Attempting to reach my onion service via the tor browser returns:
Onionsite Not Found
An error occurred during a connection to [redacted].onion.
Details: 0xF0 — The requested onion service descriptor can't be found on the hashring and therefore the service is not reachable by the client.
All "obb" pods appear to be working as expected, but the "daemon" pod potentially has deadlocked after a restart (see below for details). Deleting the daemon pod, and allowing it to be recreated/restarted resolves the issue.
To Reproduce
I have not figured out specific steps to reproduce this yet, other than waiting long enough. Although, I have a suspicion it happens when the pod restarts itself (I will continue to try and narrow down more specific repro steps).
Expected behavior
The onion service should always be available as long as the daemon and obb pods are running.
Additional information
Logs from the onionbalance
container of the daemon
pod:
time="2023-01-06T23:08:33Z" level=info msg="Listening for events"
time="2023-01-06T23:08:33Z" level=info msg="Running event controller"
time="2023-01-06T23:08:33Z" level=info msg="Starting controller"
W0106 23:08:33.805173 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
time="2023-01-06T23:08:33Z" level=info msg="Added onionBalancedService: ingress/tor-service"
time="2023-01-06T23:08:35Z" level=info msg="Getting key ingress/tor-service"
NOTE: the actual time is now 8 hours later, so onionbalance
has not logged any additional activity for quite some time (deadlock?).
On a successful launch, I see something along the lines of:
[...]
time="2023-01-07T10:50:04Z" level=info msg="Getting key ingress/tor-service"
time="2023-01-07T10:50:04Z" level=info msg="Updating onionbalance config for ingress/tor-service"
reloading onionbalance...
starting onionbalance...
2023-01-07 10:50:15,789 [WARNING]: Initializing onionbalance (version: 0.2.2)...
[...]
System (please complete the following information):
- Platform: amd64
- Version: v1.25.5-k3s1
Additional context
This does not happen often, but it has occurred 4 or 5 times over the past ~3 months. Anecdotally, I believe the last few times this has happened was after/around performing system upgrades on my cluster (ex: upgrading Kubernetes, or restarting nodes), where lots of pods are bouncing around.
The remedy is simple (manually restart the daemon pod), but an automated fix would be preferred. If actually resolving the deadlock (if that's truly the issue...) is overly complex to diagnose at this time, I wonder if an easier fix might be to simply add a probe that can properly detect this condition? Any thoughts on how I could do this?
Hi @conneryn! Thanks again for your detailed bug (& fix!). Gonna review it in a bit
PR #29 is good to go / merged. I'll prepare a new release after the build finishes
Sorry for the delay. I've just published the fix under 0.7.1 (helm chart tor-controller-0.1.7)