canonical/seldon-core-operator

Finalise work done for Seldon Core Operator during Obeservability Workshop

Closed this issue · 1 comments

Finalise work done for Seldon Core Operator during Obeservability Workshop

Work items are tracked in https://warthogs.atlassian.net/browse/KF-775
Branch: https://github.com/canonical/seldon-core-operator/tree/kf-775-gh52-feat-alert-rules
Prometheus deployment https://github.com/canonical/prometheus-k8s-operator

Design

Failure alerts are implemented through integration with Prometheus Charm from Canonical Observability Stack. Prometheus creates scrape jobs based on configured alert rules defined by Seldon Core Operator Charm. Then it scrapes targets, retrieves defined metrics, and performs required calculations.

Testing

  • Setup MicroK8S cluster and Juju controller:
microk8s enable dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
juju bootstrap microk8s uk8s
juju add-model test
  • Deploy Prometheus and Seldon Core Operator and relate them.
juju deploy prometheus-k8s --trust
juju deploy ./seldon-core_ubuntu-20.04-amd64.charm seldon-controller-manager --trust --resource oci-image="docker.io/seldonio/seldon-core-operator:1.14.0"
juju relate prometheus-k8s seldon-controller-manager
  • Navigate to Prometheus dashboard https://<Prometheus-unit-IP>:9090, select Status->Targets
    There should be Promethus scrape job that targets Seldon metrics endpoint (http://<Seldon-Controller-Manager-IP>:8080/metrics) entry with no errors:
    Screenshot from 2022-11-15 05-44-31

  • Deploy sample Seldon deployment in the same model to observe if any failure alert is reported by navigating to Alerts

microk8s.kubectl -n test apply -f examples/serve-simple-v1.yaml
  • To simulate failure delete deployment that was created by Seldon and observe alerts:
microk8s.kubectl -n test delete deploy/seldon-model-example-0-classifier

NOTE: That alerts window is 10 minutes. Scraping is done once per minute. Make sure at lease 2 minutes have passed for proper rate calculation.
Screenshot from 2022-11-15 14-34-23

Closing this issue.