/sei-node-monitoring

Monitoring and Alerting for Sei nodes

Primary LanguageShell

Instructions on how to set up monitoring stack for your cosmos validator

Prerequisites

Install exporters on validator node

First of all you will have to install exporters on validator node. For that you can use one-liner below

wget -O install_exporters.sh https://raw.githubusercontent.com/kj89/cosmos_node_monitoring/master/install_exporters.sh && chmod +x install_exporters.sh && ./install_exporters.sh
KEY VALUE
bond_denom Denominated token name, for example, ubld for Agoric. You can find it in genesis file
bench_prefix Prefix for chain addresses, for example, agoric for Agoric. You can find it in public addresses like this agoricvaloper1zyyz4m9ytdf60fn9yaafx7uy7h463n7alv2ete

make sure following ports are open:

  • 9100 (node-exporter)
  • 9300 (cosmos-exporter)

Deployment

Monitoring stack needs to be deployed on seperate machine to be able to notify in case if validator goes down! To run monitoring stack you dont need beastly server with multiple cores. It will be more than enough to run it on smallest available vps

System requirements

Ubuntu 20.04 / 1 VCPU / 2 GB RAM / 20 GB SSD

Install monitoring stack

To install monitirng stack you can use one-liner below

wget -O install_monitoring.sh https://raw.githubusercontent.com/sei-protocol/sei-node-monitoring/master/install_monitoring.sh && chmod +x install_monitoring.sh && ./install_monitoring.sh $NODE_IP

Add validator into prometheus configuration file

To add validator use command with specified VALIDATOR_IP, VALOPER_ADDRESS, WALLET_ADDRESS and PROJECT_NAME

$HOME/sei-node-monitoring/add_validator.sh VALIDATOR_IP VALOPER_ADDRESS WALLET_ADDRESS PROJECT_NAME

example: $HOME/sei-node-monitoring/add_validator.sh 1.2.3.4 cosmosvaloper1s9rtstp8amx9vgsekhf3rk4rdr7qvg8dlxuy8v cosmos1s9rtstp8amx9vgsekhf3rk4rdr7qvg8d6jg3tl cosmos

To add more validators just run command above with validator values

Run docker compose

Deploy the monitoring stack

cd $HOME/sei-node-monitoring && docker compose up -d

ports used:

  • 8080 (alertmanager-bot)
  • 9090 (prometheus)
  • 9093 (alertmanager)
  • 9999 (grafana)

Configuration

Configure Grafana

  1. Open Grafana in your web browser. It should be available on port 9999

image

  1. Login using defaults admin/admin and change password

  2. Import custom dashboard

3.1. Press "+" icon on the left panel and then choose "Import"

image

3.2. Input grafana.com dashboard id 15991 and press "Load"

image

3.3. Select Prometheus data source and press "Import"

image

  1. Change your chain explorer url

4.1. Edit "Top validators missing blocks panel"

image

4.2. Go to "Overrides" and edit "Data links"

image

4.3 Change url to your chain data explorer and hit "Save"

image

4.4. Hit "Save" button on the left top corner to save changes to dashboard

  1. Congratulations you have successfully configured Cosmos Validator Dashboard

Configrure Pagerduty alerting

  1. Follow the steps here to set up Pagerduty configurations (service, schedule, integration key...etc)

  2. Next, configure the integration key in prometheus/alert_manager/alertmanager.yaml

Testing

Test alerts

  1. For simple test you can stop node-exporter service for 5 minutes. It should trigger alert
systemctl stop node_exporter
  1. You will see message from bot firing

image

  1. Now you can start node-exporter service back
systemctl start node_exporter
  1. You will get confirmation from bot that issue is resolved

image

Dashboard contents

Grafana dashboard is devided into 4 sections:

  • Validator health - main stats for validator health. connected peers and missed blocks

image

  • Chain health - summary of chain health stats and list of top validators missing blocks

image

  • Validator stats - information about validator such as rank, bounded tokens, comission, delegations and rewards

image

  • Hardware health - system hardware metrics. cpu, ram, network usage

image

Cleanup all container data

cd $HOME/cosmos_node_monitoring
docker compose down
docker volume prune -f

Reference list

Resources I used in this project: