Issue in Getting Started with Active-monitor
Mahima-ai opened this issue · 6 comments
Hi Team,
I am new to Go, Kubernetes and exploring the Kubernetes monitoring tool. I came across the active-monitor tool. I am facing few issues while getting started with this tool. Any help in this regard, will be highly appreciated. The details are as under:
Versions:
OS: Linux 5.11.0-25-generic, 20.04.1-Ubuntu
Go: go1.13.8 linux/amd64
Kubectl client: v1.22.0
Kubectl Server: v1.21.2
minikube: v1.22.0
argo: v3.0.10
active-monitor: 0.6.0
Also tried with Kubectl client version:v1.19.0 and server version:v1.20.0 but still the same warnings and errors.
Issue:
While following the step 2 for both type of installation, a warning is raised regarding the CRD versions. The screenshot is attached below:
While running the main.go file, the healthcheck starts but it produces error for some go files. The error screenshot is below:
Please let me know how can I proceed further to run active-monitor.
@Mahima-ai thanks for trying out and raising the issue for rbac versioning. It should be addressed. Do you want try contributing?
for the second part ..it may some times take more time for docker image to download when you run in local.. but once the image is available the workflow would run fine fro the next iteration. The error is as such valid if your workflow doesnot complete in time it will mark that run as failed.
@RaviHari, thanks for the reply. Can you please provide me with the resources/documentation for this project? I want to get an idea about the flow of the project. As you suggested, I tried downloading the docker image docker/whalesay:latest
locally and also increased the timeout by specifying activeDeadlineSeconds: 300
in the inlineHello.yaml under spec.workflow
(i.e. spec.workflow.activeDeadlineSeconds
) but still it is displaying the same error.
Also, I am not having clarity on the error by zapLogger.
The controller as such is straight forward and all the documentation is given here: https://github.com/keikoproj/active-monitor/blob/master/README.md
I also ran into the following error first (other than warnings on roles etc., as stated above) after upgrading to minikube 1.22.0.. I had to update rback permissions.. for healtchecks/status in ClusterRole to fix the below error.
2021-08-23T14:26:59.799Z ERROR controllers.HealthCheck Error executing Workflow {"HealthCheck": "health/inline-hello", "error": "healthchecks.activemonitor.keikoproj.io "inline-hello" is forbidden: User "system:serviceaccount:health:activemonitor-controller-sa" cannot update resource "healthchecks/status" in API group "activemonitor.keikoproj.io" in the namespace "health""}
github.com/go-logr/zapr.(*zapLogger).Error
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: activemonitor-controller-clusterrole
rules:
- apiGroups:
- "*"
- activemonitor.keikoproj.io
resources:
- workflows
- monitors
- pods
- events
- healthchecks
- healthchecks/status
- serviceaccounts
- clusterroles
- clusterrolebindings
verbs:
- "*"
I ran into this error when workflow-controller pod is in crashloopbackoff state..
4:27:01Z","templateName":"whalesay","templateScope":"local/inline-hello-rvtmd","type":"Pod"}},"phase":"Running","progress":"0/1","startedAt":"2021-08-23T14:27:01Z"}, "ok:": true}
2021-08-23T14:28:00.227Z ERROR controllers.HealthCheck iebackoff err message {"HealthCheck": "health/inline-hello", "error": "no more retries left"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).watchWorkflowReschedule
/workspace/controllers/healthcheck_controller.go:565
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).processHealthCheck
/workspace/controllers/healthcheck_controller.go:225
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).processOrRecoverHealthCheck
/workspace/controllers/healthcheck_controller.go:150
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).Reconcile
/workspace/controllers/healthcheck_controller.go:140
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88
2021-08-23T14:28:00.227Z INFO controllers.HealthCheck status of workflow is updated to Failed {"HealthCheck": "health/inline-hello", "status:": {"message":"Failed","phase":"Failed"}}
2021-08-23T14:28:00.227Z INFO controllers.HealthCheck Workflow status {"HealthCheck": "health/inline-hello", "status": "Failed"}
2021-08-23T14:28:00.227Z INFO controllers.HealthCheck Remedy values: {"HealthCheck": "health/inline-hello", "RemedyTotalRuns:": 0}
2021-08-23T14:28:00.227Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"HealthCheck","namespace":"health","name":"inline-hello","uid":"c9f7a495-7d41-48ce-86fa-7679aac63cbc","apiVersion":"activemonitor.keikoproj.io/v1alpha1","resourceVersion":"1006781"}, "reason": "Warning", "message": "Workflow timed out"}
2021-08-23T14:28:00.227Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"HealthCheck","namespace":"health","name":"inline-hello","uid":"c9f7a495-7d41-48ce-86fa-7679aac63cbc","apiVersion":"activemonitor.keikoproj.io/v1alpha1","resourceVersion":"1006781"}, "reason": "Warning", "message": "Workflow status is Failed"}
2021-08-23T14:28:00.240Z INFO controllers.HealthCheck Rescheduled workflow for next run {"HealthCheck": "health/inline-hello", "namespace": "health", "name": "inline-hello-rvtmd"}
2021-08-23T14:28:00.240Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"HealthCheck","namespace":"health","name":"inline-hello","uid":"c9f7a495-7d41-48ce-86fa-7679aac63cbc","apiVersion":"activemonitor.keikoproj.io/v1alpha1","resourceVersion":"1007485"}, "reason": "Normal", "message": "Rescheduled workflow for next run"}
2021-08-23T14:28:00.247Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "healthcheck", "request": "health/inline-hello"}
2021-08-23T14:28:00.247Z INFO controllers.HealthCheck Starting HealthCheck reconcile for ... {"HealthCheck": "health/inline-hello"}
2021-08-23T14:28:00.247Z INFO controllers.HealthCheck Workflow already executed {"HealthCheck": "health/inline-hello", "finishedAtTime": 1629728880}
2021-08-23T14:28:00.252Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "healthcheck", "request": "health/inline-hello"}
» kh get po
NAME READY STATUS RESTARTS AGE
activemonitor-controller-6ddf9479d5-ht9wg 1/1 Running 0 5m59s
inline-hello-rvtmd 0/2 Completed 0 71s
workflow-controller-69c95cddc-cn2g5 0/1 CrashLoopBackOff 765 66d
If you see this.. please describe the pod and paste the info here.. once you delete the workflow controller.. it should be fine.. I rarely see this as a problem. However it could be investigated if it is happening consistently.
I too ran in the error 'ERROR executing the workflow` for healthcheck/status mentioned by you. I edited the deploy/deploy-active-monitor.yaml file as suggested by you. This has given another error. The screenshot is as under:
Regarding the CrashLoopBackOff Status
, I saw the same status yesterday but I haven't seen it today.
There seems to be issue with rbac setup in your case as the activemonitor-sa is not able to create ClusterRole and Rbac for healthcheck workflow..
@Mahima-ai can you connect in this slack channel for further debugging your issue: https://join.slack.com/t/orkaproj/shared_invite/enQtNzM3MTM1MDA5MjcxLWU4NTc5Nzc5OTVjOWI1NzA5NWNmNGExMDBmNjU2MDE1ZmZiOGU3ZGZkYmY0N2UzMzQ5MDEyMzQwY2UyMjdhOGI
There seems to be issue with rbac setup in your case as the activemonitor-sa is not able to create ClusterRole and Rbac for healthcheck workflow..
@Mahima-ai can you connect in this slack channel for further debugging your issue: https://join.slack.com/t/orkaproj/shared_invite/enQtNzM3MTM1MDA5MjcxLWU4NTc5Nzc5OTVjOWI1NzA5NWNmNGExMDBmNjU2MDE1ZmZiOGU3ZGZkYmY0N2UzMzQ5MDEyMzQwY2UyMjdhOGI
Thanks @RaviHari for resolving the issue.