๐ [bug] - PetBattleMongoDBDiskUsage alert is not reported
Opened this issue ยท 9 comments
๐ Description
Following the instructions of the Creating Alerts exercise, the PetBattleMongoDBDiskUsage
alert is not reported.
The status of the PVC, after execute the command `` is:
But the alert is not reported.
Also if we forced to consume all the space, the alert is also not reported:
sh-4.2$ dd if=/dev/urandom of=/var/lib/mongodb/data/rando-calrissian bs=10M count=80
dd: error writing '/var/lib/mongodb/data/rando-calrissian': No space left on device
66+0 records in
65+0 records out
688480256 bytes (688 MB) copied, 6.29632 s, 109 MB/s
๐ถ Steps to reproduce
Followed the instructions of this exercise.
๐งโโ๏ธ Suggested solution
... if applicable
@rmarting - i tried this out in my cluster and it seems ok. i dropped the alert to 40% to check
metrics being reported ok
and the firing rule
any chance you could debug a little further ? see what may be going on in your cluster - see if metrics reporting first ?
also .. i have these rules in my project after running thru all exercises (the one above is part of "pb-api-alerts")
$ oc get prometheusrule -n ateam-test
NAME AGE
blue-pet-battle 134d
green-pet-battle 134d
keycloak 38h
pb-api-alerts 134d
pet-battle 134d
pet-battle-b 134d
During the enablement in Frankfurt my teammates encountered exactly the same issue, which was also confirmed by @rmarting.
Reproduced in a new cluster following the next steps:
- Used teamsters to deploy technical exercises 1+2 for a new user and set up the environment. Created CRW, environment variables and move to Alerting exercise.
- Add new rules in the pet-battle-api helm chart (
prometheusrule.yaml
) - Bumping
Chart.yaml
file with a new version (1.3.2). Incremented from the version already deployed in Nexus as part of the activities in technical exercise 2. - The pipeline updates the Helm Chart 1.3.1 tgz file (overwriting it) instead of creating the new version. IMHO the pipeline is using the version from
pom.xml
file (maven task) instead of the Helm Chart version to create the newtgz
file in Nexus. - ArgoCD is not synchronizing the new version and it is using the previous one (1.3.1) in
tech-exercise/pet-battle/test/values.yaml
.
Workaround: Changing the pom.xml
file to the new version 1.3.2 fixed the issue. A new Helm Chart tgz
file is uploaded, the pet-battle-api
version in tech-exercise/pet-battle/test/values.yaml
is updated, and everything is deployed in OpenShift. So the alert is shown successfully.
Could you double-check my findings? Maybe we need to extend the instructions to align the helm chart version and app to deploy successfully from ArgoCD, or maybe we need to review the Tekton pipeline about the right version from the right file (pom.xml
, or Chart.yaml
).
Ahh ! that makes sense @rmarting .. i see what is going on now.
OK, the section in 4.2.4 is wrong - i have fixed this now. PTAL at this commit:
The history here is this:
- at some time in the past we allowed users to update app and chart version separately (helm chart default behaviour)
- we changed the pipeline in pet-battle-api to match pet-battle UX where app version was solely controlled by:
VERSION - file for node.js
pox.xml - file for java - this matched what "developers" would do .. i.e. control it from source and not worry too much about yaml files ! and let the pipeline deal with it
- the commit for this was here:
255556e - however it seems we did not go back and update all the right bits (monitoring) for this change.
I think there is still a question in my mind as to why argocd does not sync the new chart (same version) .. we may find that this is just the difference in argo between a sync, a refresh and a hard refresh. i.e. hitting the sync button may have solved this as the chart of the same version is timestamped in nexus .. so you will always get the updated chart as we push it there in the pipeline. need to test this and take a look at where it is getting "stale"
Great @eformat !!! Everything makes sense now!
Reviewing the new content I am wondering if the 1.3.1.
version is the best one, as technical exercise 2 defines that version. If we are triggering a new version, then the 1.3.2
version fits better, or another different from the current version in the pom.xml
file. If you update the content to that version, then LGTM to go ahead and close this issue.
On the other hand, Why does ArgoCD not sync the new chart? It could be something related to the different sync options. However, IMHO if we want to deploy a new chart for an application, it is better to use a new Helm Chart version and not to override in Nexus. I don't like at all the idea to override artifacts versions in Nexus (only for SNAPSHOTS), as you can't control who downloaded or not. As Helm Chart hasn't snapshot versions, the best approach is to trigger a new pipeline with a new version and then deploy it from ArgoCD. (my two cents).
If the version of the chart does not change - argocd has it cached, doing a refresh on argocd clears the cache hence it updates the k8s resources after a refresh. This is why we always bump version on main (even if its just a minor). Perhaps an automated way to ensure this doesn't happen would be to append the git sha to the version (if help supports that).
@eformat - if you rememeber, we encountered this issue when writing the book. We were changing values files bt not updating the version and argocd was not seeing the change. I think the way around this we implememented was changing the labels on the resources to contain a value from teh values file or something like that.