wutz/blog

简易告警方案

wutz opened this issue · 0 comments

wutz commented

背景

一般的告警推送栈是 prometheus -> alertmanager -> slack/wechat, 在实践中这套告警栈需要花费不少时间维护,还需要长期坚持迭代。比如有些告警是误报,抖动,或者不需要处理(属于通知),这些都需要花费时间进行维护。

长时间不进行告警本身维护,会造成告警失效,所以需要一种维护代价较低的告警方案。当然使用简易告警方案,不代表放弃监控栈本身 (prometheus+grafana), 监控栈可以用于问题排查。

如果有更多精力,还是应该走正道 (prometheus+alertmanger+karma+grafana) 来持续迭代和维护。

下面列举 2 种简易告警实现方式

通过定期执行命令检查变化推送告警

下面示例通过执行 ceph health detail 监控 ceph 存储健康变化

  1. 创建检查 ceph 健康变化的脚本

    $ sudo cat << 'EOF' > /usr/local/bin/ceph-health-check
    #!/usr/bin/env bash
    
    set -u
    #set -x
    
    PREV=/dev/shm/ceph-health-prev
    NEXT=/dev/shm/ceph-health-next
    
    mv $NEXT $PREV
    #echo "THIS IS A TEST!" >> $PREV
    ceph health detail > $NEXT
    changed=$(diff -u $PREV $NEXT)
    if [[ -n $changed ]]; then
            curl -s -d "{\"msgtype\":\"text\", \"text\":{\"content\":\"$changed\"}}"  \
                    https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=<replace-your-token>
    fi
    EOF
    $ sudo chmod +x /usr/local/bin/ceph-health-check
    • 注意替换脚本中的 <replace-your-token> 为实际值
  2. 使用 systemd 服务来定时执行

    $ sudo cat << 'EOF' > /etc/systemd/system/ceph-health-check.service
    [Unit]
    Description=Ceph Health Check
    Requires=network.target
    After=network.target
    
    [Service]
    ExecStart=/usr/local/bin/ceph-health-check
    
    [Install]
    WantedBy=multi-user.target
    EOF
    $ sudo cat << 'EOF' > /etc/systemd/system/ceph-health-check.timer
    [Unit]
    Description="Run ceph-health-check.service 10sec after boot and every 1min relative to activation time"
    
    [Timer]
    OnBootSec=10sec
    OnUnitActiveSec=1min
    Unit=ceph-health-check.service
    
    [Install]
    WantedBy=multi-user.target
    EOF
    $ sudo systemctl enable ceph-health-check.timer --now

通过监听日志更新过滤关键词推送告警

下面示例通过监听 rabbitmq 日志过滤集群产生分区问题

  1. 创建监听 rabbitmq 日志变化脚本

    $ sudo cat << 'EOF' > /usr/local/bin/rabbitmq-health-check
    #!/usr/bin/bash
    
    tail -n0 -f "$LOGGING_FILE" \
            | grep --line-buffered -E "$EGREP_PATTERNS" \
            | xargs -I {} curl -s -d '{"msgtype":"text", "text":{"content":"[node01] {}","mentioned_list":["@all"]}}' https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=$WECHAT_KEY
    EOF
  2. 创建 systemd 服务来监听日志变化和推送通知

    $ sudo cat << 'EOF' > /etc/default/rabbitmq-health-check
    LOGGING_FILE="/var/log/kolla/rabbitmq/rabbit@node01.log"
    EGREP_PATTERNS="net_tick_timeout|Partial partition detected|Cluster minority/secondary status detected|inconsistent_database"
    WECHAT_KEY="<replace-your-token>"
    EOF
    $ sudo cat << 'EOF' > /etc/systemd/system/rabbitmq-health-check.service
    [Unit]
    Description=RabbitMQ Health Check
    Requires=network.target
    After=network.target
    
    [Service]
    Restart=on-failure
    EnvironmentFile=-/etc/default/rabbitmq-health-check
    ExecStart=/usr/local/bin/rabbitmq-health-check
    
    [Install]
    WantedBy=multi-user.target
    EOF
    $ sudo systemctl enable rabbitmq-health-check --now
    • 注意替换脚本中的 <replace-your-token> 为实际值