canonical/notebook-operators

Finalise Notebooks alert rules work done during Obeservability Workshop

Closed this issue · 1 comments

Finalise work done for Jupyter Controller during Obeservability Workshop

Work items are tracked in https://warthogs.atlassian.net/browse/KF-827
Branch: https://github.com/canonical/notebook-operators/tree/kf-827-gh81-feat-alert-rules
Prometheus deployment https://github.com/canonical/prometheus-k8s-operator

Design

Failure alerts are implemented through integration with Prometheus Charm from Canonical Observability Stack. Prometheus creates scrape jobs based on configured alert rules defined by Jupyter Controller Charm. Then it scrapes targets, retrieves defined metrics, and performs required calculations.

Testing

  • Setup MicroK8S cluster and Juju controller:
microk8s enable dns storage metallb:"10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111"
juju bootstrap microk8s uk8s
juju add-model test
  • Deploy Prometheus and Jupyter Controller and relate them.
juju deploy prometheus-k8s --trust
juju deploy ./jupyter-controller_ubuntu-20.04-amd64.charm jupyter-controller --series kubernetes --trust --resource oci-image="docker.io/kubeflownotebookswg/notebook-controller:v1.6.1"
juju relate prometheus-k8s  jupyter-controller

Final deployment should be:

Model  Controller  Cloud/Region        Version  SLA          Timestamp
test   uk8s        microk8s/localhost  2.9.34   unsupported  09:26:08-05:00

App                 Version                         Status  Scale  Charm               Channel  Rev  Address        Exposed  Message
jupyter-controller  .../notebook-controller:v1.6.1  active      1  jupyter-controller             0                 no       
prometheus-k8s      2.33.5                          active      1  prometheus-k8s      stable    79  10.152.183.15  no       

Unit                   Workload  Agent  Address     Ports  Message
jupyter-controller/0*  active    idle   10.1.59.86         
prometheus-k8s/0*      active    idle   10.1.59.85         

Relation provider                    Requirer                         Interface          Type     Message
jupyter-controller:metrics-endpoint  prometheus-k8s:metrics-endpoint  prometheus_scrape  regular  
prometheus-k8s:prometheus-peers      prometheus-k8s:prometheus-peers  prometheus_peers   peer    
  • Navigate to Prometheus dashboard https://<Prometheus-unit-IP>:9090, select Status->Targets There should be Promethus scrape job that targets Jupyter Controller metrics endpoint (http://:8080/metrics) entry with no errors:
    Screenshot from 2022-12-09 09-26-48

Received alerts/rules can also be verified under Alerts tab:
Screenshot from 2022-12-09 09-28-37

PR is merged, closing.