Finalise work done for Seldon Core Operator during Obeservability Workshop
Closed this issue · 1 comments
Finalise work done for Seldon Core Operator during Obeservability Workshop
Work items are tracked in https://warthogs.atlassian.net/browse/KF-775
Branch: https://github.com/canonical/seldon-core-operator/tree/kf-775-gh52-feat-alert-rules
Prometheus deployment https://github.com/canonical/prometheus-k8s-operator
Design
Failure alerts are implemented through integration with Prometheus Charm from Canonical Observability Stack. Prometheus creates scrape jobs based on configured alert rules defined by Seldon Core Operator Charm. Then it scrapes targets, retrieves defined metrics, and performs required calculations.
Testing
- Setup MicroK8S cluster and Juju controller:
microk8s enable dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
juju bootstrap microk8s uk8s
juju add-model test
- Deploy Prometheus and Seldon Core Operator and relate them.
juju deploy prometheus-k8s --trust
juju deploy ./seldon-core_ubuntu-20.04-amd64.charm seldon-controller-manager --trust --resource oci-image="docker.io/seldonio/seldon-core-operator:1.14.0"
juju relate prometheus-k8s seldon-controller-manager
-
Navigate to Prometheus dashboard
https://<Prometheus-unit-IP>:9090
, select Status->Targets
There should be Promethus scrape job that targets Seldon metrics endpoint (http://<Seldon-Controller-Manager-IP>:8080/metrics
) entry with no errors:
-
Deploy sample Seldon deployment in the same model to observe if any failure alert is reported by navigating to Alerts
microk8s.kubectl -n test apply -f examples/serve-simple-v1.yaml
- To simulate failure delete deployment that was created by Seldon and observe alerts:
microk8s.kubectl -n test delete deploy/seldon-model-example-0-classifier
NOTE: That alerts window is 10 minutes. Scraping is done once per minute. Make sure at lease 2 minutes have passed for proper rate calculation.
Closing this issue.