简易告警方案
wutz opened this issue · 0 comments
wutz commented
背景
一般的告警推送栈是 prometheus -> alertmanager -> slack/wechat, 在实践中这套告警栈需要花费不少时间维护,还需要长期坚持迭代。比如有些告警是误报,抖动,或者不需要处理(属于通知),这些都需要花费时间进行维护。
长时间不进行告警本身维护,会造成告警失效,所以需要一种维护代价较低的告警方案。当然使用简易告警方案,不代表放弃监控栈本身 (prometheus+grafana), 监控栈可以用于问题排查。
如果有更多精力,还是应该走正道 (prometheus+alertmanger+karma+grafana) 来持续迭代和维护。
下面列举 2 种简易告警实现方式
通过定期执行命令检查变化推送告警
下面示例通过执行 ceph health detail
监控 ceph 存储健康变化
-
创建检查 ceph 健康变化的脚本
$ sudo cat << 'EOF' > /usr/local/bin/ceph-health-check #!/usr/bin/env bash set -u #set -x PREV=/dev/shm/ceph-health-prev NEXT=/dev/shm/ceph-health-next mv $NEXT $PREV #echo "THIS IS A TEST!" >> $PREV ceph health detail > $NEXT changed=$(diff -u $PREV $NEXT) if [[ -n $changed ]]; then curl -s -d "{\"msgtype\":\"text\", \"text\":{\"content\":\"$changed\"}}" \ https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=<replace-your-token> fi EOF $ sudo chmod +x /usr/local/bin/ceph-health-check
- 注意替换脚本中的
<replace-your-token>
为实际值
- 注意替换脚本中的
-
使用 systemd 服务来定时执行
$ sudo cat << 'EOF' > /etc/systemd/system/ceph-health-check.service [Unit] Description=Ceph Health Check Requires=network.target After=network.target [Service] ExecStart=/usr/local/bin/ceph-health-check [Install] WantedBy=multi-user.target EOF $ sudo cat << 'EOF' > /etc/systemd/system/ceph-health-check.timer [Unit] Description="Run ceph-health-check.service 10sec after boot and every 1min relative to activation time" [Timer] OnBootSec=10sec OnUnitActiveSec=1min Unit=ceph-health-check.service [Install] WantedBy=multi-user.target EOF $ sudo systemctl enable ceph-health-check.timer --now
通过监听日志更新过滤关键词推送告警
下面示例通过监听 rabbitmq 日志过滤集群产生分区问题
-
创建监听 rabbitmq 日志变化脚本
$ sudo cat << 'EOF' > /usr/local/bin/rabbitmq-health-check #!/usr/bin/bash tail -n0 -f "$LOGGING_FILE" \ | grep --line-buffered -E "$EGREP_PATTERNS" \ | xargs -I {} curl -s -d '{"msgtype":"text", "text":{"content":"[node01] {}","mentioned_list":["@all"]}}' https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=$WECHAT_KEY EOF
-
创建 systemd 服务来监听日志变化和推送通知
$ sudo cat << 'EOF' > /etc/default/rabbitmq-health-check LOGGING_FILE="/var/log/kolla/rabbitmq/rabbit@node01.log" EGREP_PATTERNS="net_tick_timeout|Partial partition detected|Cluster minority/secondary status detected|inconsistent_database" WECHAT_KEY="<replace-your-token>" EOF $ sudo cat << 'EOF' > /etc/systemd/system/rabbitmq-health-check.service [Unit] Description=RabbitMQ Health Check Requires=network.target After=network.target [Service] Restart=on-failure EnvironmentFile=-/etc/default/rabbitmq-health-check ExecStart=/usr/local/bin/rabbitmq-health-check [Install] WantedBy=multi-user.target EOF $ sudo systemctl enable rabbitmq-health-check --now
- 注意替换脚本中的
<replace-your-token>
为实际值
- 注意替换脚本中的