🐛 [bug] - PetBattleMongoDBDiskUsage alert is not reported

Question

🐛 [bug] - PetBattleMongoDBDiskUsage alert is not reported

Opened this issue 2 years ago · 9 comments

rmarting commented 2 years ago

📝 Description

Following the instructions of the Creating Alerts exercise, the PetBattleMongoDBDiskUsage alert is not reported.

The status of the PVC, after execute the command `` is:

But the alert is not reported.

Also if we forced to consume all the space, the alert is also not reported:

sh-4.2$ dd if=/dev/urandom of=/var/lib/mongodb/data/rando-calrissian bs=10M count=80
dd: error writing '/var/lib/mongodb/data/rando-calrissian': No space left on device
66+0 records in
65+0 records out
688480256 bytes (688 MB) copied, 6.29632 s, 109 MB/s

🚶 Steps to reproduce

Followed the instructions of this exercise.

🧙‍♀️ Suggested solution

... if applicable

Answer 1 · 2022-07-01T22:33:15.000Z

@rmarting - i tried this out in my cluster and it seems ok. i dropped the alert to 40% to check

metrics being reported ok

and the firing rule

any chance you could debug a little further ? see what may be going on in your cluster - see if metrics reporting first ?

Answer 2 · 2022-07-01T22:53:17.000Z

also .. i have these rules in my project after running thru all exercises (the one above is part of "pb-api-alerts")

$ oc get prometheusrule -n ateam-test
NAME               AGE
blue-pet-battle    134d
green-pet-battle   134d
keycloak           38h
pb-api-alerts      134d
pet-battle         134d
pet-battle-b       134d

Answer 3 · 2022-07-01T22:56:26.000Z

and i logged in as a student user .. just to check i can still see alert OK (i was checking as cluster admin above). looks ok

Answer 4 · 2022-07-08T09:13:03.000Z

During the enablement in Frankfurt my teammates encountered exactly the same issue, which was also confirmed by @rmarting.

Answer 5 · 2022-07-11T10:47:35.000Z

Reproduced in a new cluster following the next steps:

Used teamsters to deploy technical exercises 1+2 for a new user and set up the environment. Created CRW, environment variables and move to Alerting exercise.
Add new rules in the pet-battle-api helm chart (prometheusrule.yaml)
Bumping Chart.yaml file with a new version (1.3.2). Incremented from the version already deployed in Nexus as part of the activities in technical exercise 2.
The pipeline updates the Helm Chart 1.3.1 tgz file (overwriting it) instead of creating the new version. IMHO the pipeline is using the version from pom.xml file (maven task) instead of the Helm Chart version to create the new tgz file in Nexus.
ArgoCD is not synchronizing the new version and it is using the previous one (1.3.1) in tech-exercise/pet-battle/test/values.yaml.

Workaround: Changing the pom.xml file to the new version 1.3.2 fixed the issue. A new Helm Chart tgz file is uploaded, the pet-battle-api version in tech-exercise/pet-battle/test/values.yaml is updated, and everything is deployed in OpenShift. So the alert is shown successfully.

Could you double-check my findings? Maybe we need to extend the instructions to align the helm chart version and app to deploy successfully from ArgoCD, or maybe we need to review the Tekton pipeline about the right version from the right file (pom.xml, or Chart.yaml).

Answer 6 · 2022-07-11T21:54:12.000Z

Ahh ! that makes sense @rmarting .. i see what is going on now.

OK, the section in 4.2.4 is wrong - i have fixed this now. PTAL at this commit:

6437d57

The history here is this:

at some time in the past we allowed users to update app and chart version separately (helm chart default behaviour)
we changed the pipeline in pet-battle-api to match pet-battle UX where app version was solely controlled by:
VERSION - file for node.js
pox.xml - file for java
this matched what "developers" would do .. i.e. control it from source and not worry too much about yaml files ! and let the pipeline deal with it
the commit for this was here:
255556e
however it seems we did not go back and update all the right bits (monitoring) for this change.

I think there is still a question in my mind as to why argocd does not sync the new chart (same version) .. we may find that this is just the difference in argo between a sync, a refresh and a hard refresh. i.e. hitting the sync button may have solved this as the chart of the same version is timestamped in nexus .. so you will always get the updated chart as we push it there in the pipeline. need to test this and take a look at where it is getting "stale"

Answer 7 · 2022-07-12T08:39:49.000Z

Great @eformat !!! Everything makes sense now!

Reviewing the new content I am wondering if the 1.3.1. version is the best one, as technical exercise 2 defines that version. If we are triggering a new version, then the 1.3.2 version fits better, or another different from the current version in the pom.xml file. If you update the content to that version, then LGTM to go ahead and close this issue.

On the other hand, Why does ArgoCD not sync the new chart? It could be something related to the different sync options. However, IMHO if we want to deploy a new chart for an application, it is better to use a new Helm Chart version and not to override in Nexus. I don't like at all the idea to override artifacts versions in Nexus (only for SNAPSHOTS), as you can't control who downloaded or not. As Helm Chart hasn't snapshot versions, the best approach is to trigger a new pipeline with a new version and then deploy it from ArgoCD. (my two cents).

Answer 8 · 2022-07-12T10:36:23.000Z

If the version of the chart does not change - argocd has it cached, doing a refresh on argocd clears the cache hence it updates the k8s resources after a refresh. This is why we always bump version on main (even if its just a minor). Perhaps an automated way to ensure this doesn't happen would be to append the git sha to the version (if help supports that).

Answer 9 · 2022-07-12T10:37:46.000Z

@eformat - if you rememeber, we encountered this issue when writing the book. We were changing values files bt not updating the version and argocd was not seeing the change. I think the way around this we implememented was changing the labels on the resources to contain a value from teh values file or something like that.