Kubernetes healthcheck gives access denied
pbwur opened this issue ยท 12 comments
Hi,
I'm using the 2.0.0 version of Vernemq with he helmchart. Unfortunately the pod in Kubernetes remains unhealthy. The errormessage is:
Readiness probe failed: Get "http://10.244.76.200:8888/health": dial tcp 10.244.76.200:8888: connect: connection refused
From with the pod using curl with the url http://localhost:8888/health the response is as expected: {"status":"OK"}
It seems the used IP address is the problem.
Using version 2.0.0-rc1 works ok. So looking for the difference here
@pbwur Thanks, The change must be in PR #380, #382, #384 or #385 then. What does the Verne log tell?
@ashtonian does this ring a bell to you, from the changes to add optional listeners?
๐ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
๐ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.
ik zi niks on de logging dat wijst op een probleem bij de healthcheck. When the first pod (of 3) starts there are a lot of log statements like:
vmq_swc_store:handle_info/2:555: Replica meta4: Can't initialize AE exchange due to no peer available
After a while VerneMq exists. But before that I'm able to execute the healthcheck using http://localhost:8888/health successfully.
024-05-02T08:53:35.711676+00:00 [debug] <0.292.0> vmq_swc_store:handle_info/2:555: Replica meta9: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:36.920696+00:00 [debug] <0.247.0> vmq_swc_store:handle_info/2:555: Replica meta4: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:37.434670+00:00 [debug] <0.238.0> vmq_swc_store:handle_info/2:555: Replica meta3: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:37.790656+00:00 [debug] <0.283.0> vmq_swc_store:handle_info/2:555: Replica meta8: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:38.419727+00:00 [debug] <0.301.0> vmq_swc_store:handle_info/2:555: Replica meta10: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:38.744695+00:00 [debug] <0.229.0> vmq_swc_store:handle_info/2:555: Replica meta2: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:40.392832+00:00 [debug] <0.265.0> vmq_swc_store:handle_info/2:555: Replica meta6: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:41.044680+00:00 [debug] <0.256.0> vmq_swc_store:handle_info/2:555: Replica meta5: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:41.835692+00:00 [debug] <0.220.0> vmq_swc_store:handle_info/2:555: Replica meta1: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.212673+00:00 [debug] <0.292.0> vmq_swc_store:handle_info/2:555: Replica meta9: Can't initialize AE exchange due to no peer available
I'm the only pod remaining. Not performing leave and/or state purge.
2024-05-02T08:53:42.465663+00:00 [debug] <0.274.0> vmq_swc_store:handle_info/2:555: Replica meta7: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.839671+00:00 [debug] <0.283.0> vmq_swc_store:handle_info/2:555: Replica meta8: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.944858+00:00 [notice] <0.44.0> application_controller:info_exited/3:2129: Application: vmq_server. Exited: stopped. Type: permanent.
2024-05-02T08:53:42.945013+00:00 [notice] <0.44.0> application_controller:info_exited/3:2129: Application: stdout_formatter. Exited: stopped. Type: permanent.
Those "Replica" logs are normal when you have debug log level on.
I guess Kubernetes terminates the pods here, since it cannot reach the health endpoint.
๐ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
๐ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.
Probably need to add this back:
https://github.com/vernemq/docker-vernemq/pull/382/files#diff-95359b2d5d846bb085015977b06cde6a1facdc4ac553c06adb7d12e47aa39373L224-L226
May need to add the cluster port back as well.
@ashtonian Thanks, I reverted this here: #387
cc @pbwur let's see whether this resolves the issue. I can build new images tomorrow.
๐ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
๐ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.
@pbwur I have now uploaded 2.0.0 images with a tentative fix to Dockerhub. Can you test one of those to check whether the Kubernetes Health check works now?
๐ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
๐ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.
@ioolkos, it seems to work now. All 3 nodes of the cluster are starting now. Thanks for the great response!
Although probably not related, I do get an error with the second node after the first node starts successfully. After I delete the persistentStoraceClaim and start the cluster again, everything is ok.
This is part of the logging:
2024-05-03T09:00:36.793105+00:00 [info] <0.686.0> vmq_diversity_app:start/2:85: enable auth script for postgres "./share/lua/auth/postgres.lua"
Error! Failed to eval: vmq_server_cmd:node_join('VerneMQ@vernemq-0.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local')
Runtime terminating during boot ({{badkey,{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>}},[{erlang,map_get,[{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>},#{}],[{error_info,#{module=>erl_erts_errors}}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,history,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,230}]},{vmq_swc_peer_service,attempt_join,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_peer_service.erl"},{line,57}]},{vmq_server_cli,'-vmq_cluster_join_cmd/0-fun-1-',3,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,516}]},{clique_command,run,1,[{file,"/opt/vernemq/_build/default/
2024-05-03T09:00:37.798996+00:00 [error] <0.9.0>: Error in process <0.9.0> on node 'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local' with exit value:, {{badkey,{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>}},[{erlang,map_get,[{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>},#{}],[{error_info,#{module => erl_erts_errors}}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,history,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,230}]},{vmq_swc_peer_service,attempt_join,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_peer_service.erl"},{line,57}]},{vmq_server_cli,'-vmq_cluster_join_cmd/0-fun-1-',3,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,516}]},{clique_command,run,1,[{file,"/opt/vernemq/_build/default/lib/clique/src/clique_command.erl"},{line,87}]},{vmq_server_cli,command,2,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,45}]}]}
Crash dump is being written to: /erl_crash.dump...[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Stream closed EOF for mdtis-poc-mqtt/vernemq-1 (vernemq)
@pbwur I have the same issue as the one you describe in your last comment above: When restarting a pod of the vernemq stateful set, I get the exact same error; only after deleting the PVC (and underlying PV) and restarting the pod it comes up again. This issue started with 2.0.0, I did not have it with 1.13.
Did you, by any chance, resolve that issue on your side? If yes, I would be thankful to hear how :)
@pbwur @hsudbrock Currently looking into the PVC related start error; it looks like some sort of regression.
The following setting in vernemq.conf
should prevent it: (by switching to the previous join logic)
vmq_swc.prevent_nonempty_join = off
Hi @hsudbrock and @ioolkos , apologies for the late response. That issue did still happen here also.
It would be great if that setting would fix it. What would be the correct environment variable to set it? DOCKER_VERNEMQ_VMQ_SWC__PREVENT__NONEMPTY__JOIN?
@pbwur DOCKER_VERNEMQ_VMQ_SWC__PREVENT_NONEMPTY_JOIN
(translate .
to __
, keep _
as _
)
Thanks for the hint and the PR for fixing the issue! For me, so far it looks good, i.e., disabling the nonempty join check resulted in no errors when restarting my vernemq cluster so far.