pambrose/prometheus-proxy

Agent cannot reconnect to proxy due to invalid agentId

danvatca opened this issue · 8 comments

I am running into an issue where, after some time, the proxy agent disconnects from the proxy and cannot reconnect.

This is what I get as logs from the agent container:

09:00:24.203 INFO  [Agent.kt:194] - Waited 3.00s to reconnect [Agent Unnamed-cf3b11648a1e]
09:00:24.203 INFO  [AgentGrpcService.kt:122] - Creating gRPC stubs [Agent Unnamed-cf3b11648a1e]
09:00:24.203 INFO  [GrpcDsl.kt:48] - Creating connection for gRPC server at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext [Agent Unnamed-cf3b11648a1e]
09:00:24.204 INFO  [Agent.kt:133] - Resetting agentId [Agent Unnamed-cf3b11648a1e]
09:00:24.204 INFO  [AgentGrpcService.kt:147] - Connecting to proxy at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext... [Agent Unnamed-cf3b11648a1e]
09:00:24.241 INFO  [AgentClientInterceptor.kt:51] - Assigned agentId: 5 to Agent{agentId=5, agentName=Unnamed-cf3b11648a1e, proxyHost=syneto-scrappy-metrics.my.syneto.eu:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump]}, metricsService=MetricsService{port=8083, path=/metrics}} [grpc-default-executor-377]
09:00:24.251 INFO  [AgentGrpcService.kt:149] - Connected to proxy at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext [Agent Unnamed-cf3b11648a1e]
09:00:24.278 INFO  [Agent.kt:181] - Disconnected from proxy at syneto-scrappy-metrics.my.syneto.eu:50051 after invalid response registerAgent() - Invalid agentId: 5 [Agent Unnamed-cf3b11648a1e]

Restarting the agent several times does not fix the issue, and the log is the same.

Hoewever, after I restart the prometheus proxy, the agent can reconnect properly:

09:01:18.204 INFO  [AgentGrpcService.kt:147] - Connecting to proxy at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext... [Agent Unnamed-cf3b11648a1e]
09:01:19.385 INFO  [AgentClientInterceptor.kt:51] - Assigned agentId: 1 to Agent{agentId=1, agentName=Unnamed-cf3b11648a1e, proxyHost=syneto-scrappy-metrics.my.syneto.eu:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump]}, metricsService=MetricsService{port=8083, path=/metrics}} [grpc-default-executor-377]
09:01:19.651 INFO  [AgentGrpcService.kt:149] - Connected to proxy at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext [Agent Unnamed-cf3b11648a1e]
09:01:19.821 INFO  [AgentPathManager.kt:67] - Registered http://172.16.254.2:9272/metrics as /esxi-g6fn04700be2 [Agent Unnamed-cf3b11648a1e]
09:01:19.859 INFO  [AgentPathManager.kt:67] - Registered http://172.16.254.2:9100/metrics as /node-g6fn04700be2 [Agent Unnamed-cf3b11648a1e]
09:01:19.873 INFO  [Agent.kt:221] - Heartbeat scheduled to fire after 5.00s of inactivity [DefaultDispatcher-worker-2]

All configurations are vanilla, and I could not find anything that shows where the problem might be.

I tried to reproduce the problem by restarting both the agent and the proxy multiple times, but the issue does not show immediately after proxy restart. The only way to reproduce it is to wait (did not check exactly how long, but less than a day). Once the issue shows up, the only way to reconnect the agent is by restarting the server.

Hmmm. I have not seen such behavior before. The connections are just gRPC calls, which are pretty straight forward. What is the error message when the agent fails to connect?

The gRPC connections/reconnections work fine. It is the response that the agentId is not valid that makes the client drop the connection and retry.
In the meantime I reconfigured the proxy to disable internal.staleAgentCheckEnabled.
The issue happened right after the client apparently went missing for more than 15s (maxAgentInactivitySecs), and the AgentContextCleanupService did a proxy.removeAgentContext.
When the proxy evicts the agent after 15s of inactivity, all agent reconnect attempts get an Invalid agentId response.
With the cleanup service disabled, 20h have passed without the issue showing up again.

Hi Dan,

That is very helpful information. Let me look into it on my end this weekend.

Thanks,
Paul

I looked at the code and simulated an agent timing out and needing to reconnect, but it worked for me. After reconnecting, the agentId gets incremented and it rolls along. So I am not sure why you were seeing it getting stuck. I published a new release 1.9.0. Can you give that a try and see if you see better results? I also bumped the default maxAgentInactivitySecs to 60 secs, but that will no address the issue you witnessed.

Thanks,
Paul

I will update to 1.9.0, and enable stale agent checks. Then, I will try to reproduce it again, and let you know how it goes.

Great. Thanks.

Hi Paul,

I updated both proxy and agent to 1.9.0, enabled the inactivity check, but bumped my maxAgentInactivitySecs to 600s so it can withstand a longer system downtime.
This time around it behaved properly, even though this morning (4 hours ago, in Europe) I had an incident that got the proxy and agent disconnected for 1h30m (our ISP's networking guy disconnected a wrong cable, and it took 1 hour to find out what was wrong).
After reconnecting the cables, the agent reconnected properly (times are GMT):

07:46:59.262 INFO  [GrpcDsl.kt:46] - Creating connection for gRPC server at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext [Agent Unnamed-e79e28587aa4]
07:46:59.263 INFO  [Agent.kt:134] - Resetting agentId [Agent Unnamed-e79e28587aa4]
07:46:59.263 INFO  [AgentGrpcService.kt:148] - Connecting to proxy at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext... [Agent Unnamed-e79e28587aa4]
07:46:59.303 INFO  [AgentClientInterceptor.kt:51] - Assigned agentId: 1 to Agent{agentId=1, agentName=Unnamed-e79e28587aa4, proxyHost=syneto-scrappy-metrics.my.syneto.eu:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump]}, metricsService=MetricsService{port=8083, path=/metrics}} [grpc-default-executor-107]
07:46:59.305 INFO  [AgentGrpcService.kt:150] - Connected to proxy at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext [Agent Unnamed-e79e28587aa4]
07:46:59.331 INFO  [Agent.kt:182] - Disconnected from proxy at syneto-scrappy-metrics.my.syneto.eu:50051 after invalid response registerAgent() - Invalid agentId: 1 [Agent Unnamed-e79e28587aa4]
07:47:02.263 INFO  [Agent.kt:195] - Waited 3.00s to reconnect [Agent Unnamed-e79e28587aa4]
07:47:02.263 INFO  [AgentGrpcService.kt:122] - Creating gRPC stubs [Agent Unnamed-e79e28587aa4]
07:47:02.263 INFO  [GrpcDsl.kt:46] - Creating connection for gRPC server at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext [Agent Unnamed-e79e28587aa4]
07:47:02.263 INFO  [Agent.kt:134] - Resetting agentId [Agent Unnamed-e79e28587aa4]
07:47:02.263 INFO  [AgentGrpcService.kt:148] - Connecting to proxy at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext... [Agent Unnamed-e79e28587aa4]
07:47:02.351 INFO  [AgentClientInterceptor.kt:51] - Assigned agentId: 2 to Agent{agentId=2, agentName=Unnamed-e79e28587aa4, proxyHost=syneto-scrappy-metrics.my.syneto.eu:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump]}, metricsService=MetricsService{port=8083, path=/metrics}} [grpc-default-executor-107]
07:47:02.374 INFO  [AgentGrpcService.kt:150] - Connected to proxy at syneto-scrappy-metrics.my.syneto.eu:50051 using plaintext [Agent Unnamed-e79e28587aa4]
07:47:02.412 INFO  [AgentPathManager.kt:67] - Registered http://172.16.254.2:9272/metrics as /esxi-g6fn04700be2 [Agent Unnamed-e79e28587aa4]
07:47:02.433 INFO  [AgentPathManager.kt:67] - Registered http://172.16.254.2:9100/metrics as /node-g6fn04700be2 [Agent Unnamed-e79e28587aa4]
07:47:02.433 INFO  [Agent.kt:222] - Heartbeat scheduled to fire after 5.00s of inactivity [DefaultDispatcher-worker-2]

On the proxy side, the agent got indeed evicted due to inactivity:

06:41:37.938 INFO  [AgentContextCleanupService.kt:49] - Evicting agent after 603s of inactivity AgentContext{agentId=1, launchId=MmGoxRkvMf2T9zY, consolidated=false, valid=true, agentName=Unnamed-e79e28587aa4, hostName=syneto-scrappy-metrics.my.syneto.eu, remoteAddr=Unknown, lastRequestDuration=603s} [AgentContextCleanupService]
06:41:37.941 INFO  [ProxyPathManager.kt:119] - Removing paths for agentId: 1 [AgentContextCleanupService]
06:41:37.944 INFO  [ProxyPathManager.kt:127] - Removed path /node-g6fn04700be2 for AgentContextInfo(consolidated=false, agentContexts=[AgentContext{agentId=1, launchId=MmGoxRkvMf2T9zY, consolidated=false, valid=true, agentName=Unnamed-e79e28587aa4, hostName=syneto-scrappy-metrics.my.syneto.eu, remoteAddr=Unknown, lastRequestDuration=603s}]) [AgentContextCleanupService]
06:41:37.944 INFO  [ProxyPathManager.kt:127] - Removed path /esxi-g6fn04700be2 for AgentContextInfo(consolidated=false, agentContexts=[AgentContext{agentId=1, launchId=MmGoxRkvMf2T9zY, consolidated=false, valid=true, agentName=Unnamed-e79e28587aa4, hostName=syneto-scrappy-metrics.my.syneto.eu, remoteAddr=Unknown, lastRequestDuration=603s}]) [AgentContextCleanupService]
06:41:38.055 INFO  [CallLogging.kt:165] - 503 Service Unavailable: GET - /esxi-g6fn04700be2 - localhost [DefaultDispatcher-worker-3]

There is one thing I noted about this reconnection: after Resetting agentId, on the next connection the agent ID changed from 1 to 2. Before, the agent was always trying to use the same agent ID, even after writing out the Resetting agentId message.

Maybe the info above can help explain what happened? Or at least explain why increasing maxAgentInactivitySecs will not trigger this behavior?

Could this be some race in AgentContextCleanupService running as an GenericExecutionThreadService?
I'm not really familiar with Java threading and how this wrapper works, but maybe it helps you in some way.

Thanks for looking into this!

Hi Dan,

Sorry for the delayed response. I am glad to hear it is working for you.

Yes, it should be retrying with a new agentId after an eviction. I am not sure why the agent ID was not getting incremented before. I will watch for it happening it again.

Thanks for your patience.

Cheers,
Paul