[BUG] OnionBalancedService periodically stops working, resulting in Onion Service not being found

Question

[BUG] OnionBalancedService periodically stops working, resulting in Onion Service not being found

conneryn opened this issue 2 years ago · 3 comments

Describe the bug
After running an OnionBalancedService for a period of time, eventually the onion address is no longer resolvable.

Attempting to reach my onion service via the tor browser returns:

Onionsite Not Found

An error occurred during a connection to [redacted].onion. 

Details: 0xF0 — The requested onion service descriptor can't be found on the hashring and therefore the service is not reachable by the client.

All "obb" pods appear to be working as expected, but the "daemon" pod potentially has deadlocked after a restart (see below for details). Deleting the daemon pod, and allowing it to be recreated/restarted resolves the issue.

To Reproduce
I have not figured out specific steps to reproduce this yet, other than waiting long enough. Although, I have a suspicion it happens when the pod restarts itself (I will continue to try and narrow down more specific repro steps).

Expected behavior
The onion service should always be available as long as the daemon and obb pods are running.

Additional information

Logs from the onionbalance container of the daemon pod:

time="2023-01-06T23:08:33Z" level=info msg="Listening for events"
time="2023-01-06T23:08:33Z" level=info msg="Running event controller"
time="2023-01-06T23:08:33Z" level=info msg="Starting controller"
W0106 23:08:33.805173       1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
time="2023-01-06T23:08:33Z" level=info msg="Added onionBalancedService: ingress/tor-service"
time="2023-01-06T23:08:35Z" level=info msg="Getting key ingress/tor-service"

NOTE: the actual time is now 8 hours later, so onionbalance has not logged any additional activity for quite some time (deadlock?).

On a successful launch, I see something along the lines of:

[...]
time="2023-01-07T10:50:04Z" level=info msg="Getting key ingress/tor-service"
time="2023-01-07T10:50:04Z" level=info msg="Updating onionbalance config for ingress/tor-service"
reloading onionbalance...
starting onionbalance...
2023-01-07 10:50:15,789 [WARNING]: Initializing onionbalance (version: 0.2.2)...
[...]

System (please complete the following information):

Platform: amd64
Version: v1.25.5-k3s1

Additional context
This does not happen often, but it has occurred 4 or 5 times over the past ~3 months. Anecdotally, I believe the last few times this has happened was after/around performing system upgrades on my cluster (ex: upgrading Kubernetes, or restarting nodes), where lots of pods are bouncing around.

The remedy is simple (manually restart the daemon pod), but an automated fix would be preferred. If actually resolving the deadlock (if that's truly the issue...) is overly complex to diagnose at this time, I wonder if an easier fix might be to simply add a probe that can properly detect this condition? Any thoughts on how I could do this?

Answer 1 · 2023-01-09T12:40:07.000Z

Hi @conneryn! Thanks again for your detailed bug (& fix!). Gonna review it in a bit

Answer 2 · 2023-01-09T12:41:59.000Z

PR #29 is good to go / merged. I'll prepare a new release after the build finishes

Answer 3 · 2023-01-15T22:02:10.000Z

Sorry for the delay. I've just published the fix under 0.7.1 (helm chart tor-controller-0.1.7)