keikoproj/active-monitor

Issue in Getting Started with Active-monitor

Mahima-ai opened this issue · 6 comments

Hi Team,

I am new to Go, Kubernetes and exploring the Kubernetes monitoring tool. I came across the active-monitor tool. I am facing few issues while getting started with this tool. Any help in this regard, will be highly appreciated. The details are as under:

Versions:
OS: Linux 5.11.0-25-generic, 20.04.1-Ubuntu
Go: go1.13.8 linux/amd64
Kubectl client: v1.22.0
Kubectl Server: v1.21.2
minikube: v1.22.0
argo: v3.0.10
active-monitor: 0.6.0

Also tried with Kubectl client version:v1.19.0 and server version:v1.20.0 but still the same warnings and errors.

Issue:
While following the step 2 for both type of installation, a warning is raised regarding the CRD versions. The screenshot is attached below:
err1
err2

While running the main.go file, the healthcheck starts but it produces error for some go files. The error screenshot is below:
Err 3

Please let me know how can I proceed further to run active-monitor.

@Mahima-ai thanks for trying out and raising the issue for rbac versioning. It should be addressed. Do you want try contributing?

for the second part ..it may some times take more time for docker image to download when you run in local.. but once the image is available the workflow would run fine fro the next iteration. The error is as such valid if your workflow doesnot complete in time it will mark that run as failed.

@RaviHari, thanks for the reply. Can you please provide me with the resources/documentation for this project? I want to get an idea about the flow of the project. As you suggested, I tried downloading the docker image docker/whalesay:latest locally and also increased the timeout by specifying activeDeadlineSeconds: 300 in the inlineHello.yaml under spec.workflow (i.e. spec.workflow.activeDeadlineSeconds) but still it is displaying the same error.

Also, I am not having clarity on the error by zapLogger.

The controller as such is straight forward and all the documentation is given here: https://github.com/keikoproj/active-monitor/blob/master/README.md

I also ran into the following error first (other than warnings on roles etc., as stated above) after upgrading to minikube 1.22.0.. I had to update rback permissions.. for healtchecks/status in ClusterRole to fix the below error.

2021-08-23T14:26:59.799Z ERROR controllers.HealthCheck Error executing Workflow {"HealthCheck": "health/inline-hello", "error": "healthchecks.activemonitor.keikoproj.io "inline-hello" is forbidden: User "system:serviceaccount:health:activemonitor-controller-sa" cannot update resource "healthchecks/status" in API group "activemonitor.keikoproj.io" in the namespace "health""}
github.com/go-logr/zapr.(*zapLogger).Error

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: activemonitor-controller-clusterrole
rules:
- apiGroups:
  - "*"
  - activemonitor.keikoproj.io
  resources:
  - workflows
  - monitors
  - pods
  - events
  - healthchecks
  - healthchecks/status
  - serviceaccounts
  - clusterroles
  - clusterrolebindings
  verbs:
  - "*"

I ran into this error when workflow-controller pod is in crashloopbackoff state..

4:27:01Z","templateName":"whalesay","templateScope":"local/inline-hello-rvtmd","type":"Pod"}},"phase":"Running","progress":"0/1","startedAt":"2021-08-23T14:27:01Z"}, "ok:": true}
2021-08-23T14:28:00.227Z	ERROR	controllers.HealthCheck	iebackoff err message	{"HealthCheck": "health/inline-hello", "error": "no more retries left"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).watchWorkflowReschedule
	/workspace/controllers/healthcheck_controller.go:565
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).processHealthCheck
	/workspace/controllers/healthcheck_controller.go:225
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).processOrRecoverHealthCheck
	/workspace/controllers/healthcheck_controller.go:150
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).Reconcile
	/workspace/controllers/healthcheck_controller.go:140
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88
2021-08-23T14:28:00.227Z	INFO	controllers.HealthCheck	status of workflow is updated to Failed	{"HealthCheck": "health/inline-hello", "status:": {"message":"Failed","phase":"Failed"}}
2021-08-23T14:28:00.227Z	INFO	controllers.HealthCheck	Workflow status	{"HealthCheck": "health/inline-hello", "status": "Failed"}
2021-08-23T14:28:00.227Z	INFO	controllers.HealthCheck	Remedy values:	{"HealthCheck": "health/inline-hello", "RemedyTotalRuns:": 0}
2021-08-23T14:28:00.227Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"HealthCheck","namespace":"health","name":"inline-hello","uid":"c9f7a495-7d41-48ce-86fa-7679aac63cbc","apiVersion":"activemonitor.keikoproj.io/v1alpha1","resourceVersion":"1006781"}, "reason": "Warning", "message": "Workflow timed out"}
2021-08-23T14:28:00.227Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"HealthCheck","namespace":"health","name":"inline-hello","uid":"c9f7a495-7d41-48ce-86fa-7679aac63cbc","apiVersion":"activemonitor.keikoproj.io/v1alpha1","resourceVersion":"1006781"}, "reason": "Warning", "message": "Workflow status is Failed"}
2021-08-23T14:28:00.240Z	INFO	controllers.HealthCheck	Rescheduled workflow for next run	{"HealthCheck": "health/inline-hello", "namespace": "health", "name": "inline-hello-rvtmd"}
2021-08-23T14:28:00.240Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"HealthCheck","namespace":"health","name":"inline-hello","uid":"c9f7a495-7d41-48ce-86fa-7679aac63cbc","apiVersion":"activemonitor.keikoproj.io/v1alpha1","resourceVersion":"1007485"}, "reason": "Normal", "message": "Rescheduled workflow for next run"}
2021-08-23T14:28:00.247Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "healthcheck", "request": "health/inline-hello"}
2021-08-23T14:28:00.247Z	INFO	controllers.HealthCheck	Starting HealthCheck reconcile for ...	{"HealthCheck": "health/inline-hello"}
2021-08-23T14:28:00.247Z	INFO	controllers.HealthCheck	Workflow already executed	{"HealthCheck": "health/inline-hello", "finishedAtTime": 1629728880}
2021-08-23T14:28:00.252Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "healthcheck", "request": "health/inline-hello"}

» kh get po                                                                                                           
NAME                                        READY   STATUS             RESTARTS   AGE
activemonitor-controller-6ddf9479d5-ht9wg   1/1     Running            0          5m59s
inline-hello-rvtmd                          0/2     Completed          0          71s
workflow-controller-69c95cddc-cn2g5         0/1     CrashLoopBackOff   765        66d

If you see this.. please describe the pod and paste the info here.. once you delete the workflow controller.. it should be fine.. I rarely see this as a problem. However it could be investigated if it is happening consistently.

I too ran in the error 'ERROR executing the workflow` for healthcheck/status mentioned by you. I edited the deploy/deploy-active-monitor.yaml file as suggested by you. This has given another error. The screenshot is as under:
err1

Regarding the CrashLoopBackOff Status, I saw the same status yesterday but I haven't seen it today.

There seems to be issue with rbac setup in your case as the activemonitor-sa is not able to create ClusterRole and Rbac for healthcheck workflow..

@Mahima-ai can you connect in this slack channel for further debugging your issue: https://join.slack.com/t/orkaproj/shared_invite/enQtNzM3MTM1MDA5MjcxLWU4NTc5Nzc5OTVjOWI1NzA5NWNmNGExMDBmNjU2MDE1ZmZiOGU3ZGZkYmY0N2UzMzQ5MDEyMzQwY2UyMjdhOGI

There seems to be issue with rbac setup in your case as the activemonitor-sa is not able to create ClusterRole and Rbac for healthcheck workflow..

@Mahima-ai can you connect in this slack channel for further debugging your issue: https://join.slack.com/t/orkaproj/shared_invite/enQtNzM3MTM1MDA5MjcxLWU4NTc5Nzc5OTVjOWI1NzA5NWNmNGExMDBmNjU2MDE1ZmZiOGU3ZGZkYmY0N2UzMzQ5MDEyMzQwY2UyMjdhOGI

Thanks @RaviHari for resolving the issue.