监控-08-skywalking监控告警


前言

skywalking根据监控规则,通过webhook调后端微服务接口,从而发送告警邮件。


一、准备

根据上几章内容,保证skywalking能监控到linux数据:
在这里插入图片描述

根据上一章内容,启动微服务,保证skywalkingNotifyToQqEmail可用。
在这里插入图片描述

二、配置skywalking

2.1 修改alarm-settings.yml

cd /opt/skywalking/apache-skywalking-apm-bin/config
vim alarm-settings.yml

将hooks注释去掉,url修改为自己微服务地址。
在这里插入图片描述

新增1条监控规则,监控cpu使用率过高。为了测试,修改为超过20%则告警。
在这里插入图片描述

完整的配置文件:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    # A MQE expression, the result type must be `SINGLE_VALUE` and the root operation of the expression must be a Compare Operation
    # which provides `1`(true) or `0`(false) result. When the result is `1`(true), the alarm will be triggered.
    expression: sum(service_resp_time > 1000) >= 3
    period: 10
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
#  service_resp_time_rule:
#    expression: avg(service_resp_time) > 1000
#    period: 10
#    silence-period: 5
#    message: Avg response time of service {name} is more than 1000ms in last 10 minutes.
  service_sla_rule:
    expression: sum(service_sla < 8000) >= 2
    # The length of time to evaluate the metrics
    period: 10
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    expression: sum(service_percentile{p='50,75,90,95,99'} > 1000) >= 3
    period: 10
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    expression: sum(service_instance_resp_time > 1000) >= 2
    period: 10
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    expression: sum(database_access_resp_time > 1000) >= 2
    period: 10
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    expression: sum(endpoint_relation_resp_time > 1000) >= 2
    period: 10
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
  cpu_usage_high_rule:
    # MQE 表达式,返回 1 (true) 或 0 (false) 的结果。这里表示当 CPU 利用率超过 20% 时,触发告警。
    expression: avg(meter_vm_cpu_total_percentage) > 20
    # 每隔 1 分钟评估一次
    period: 1
    # 静默时间,告警触发后 3 分钟内不重复发送
    silence-period: 3
    # 告警消息
    message: CPU usage of {name} is above 200%.
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_resp_time_rule:
#    expression: sum(endpoint_resp_time > 1000) >= 2
#    period: 10
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

hooks:
  webhook:
    default:
      is-default: true
      urls:
        - http://10.211.55.2:8001/api/test/skywalkingNotifyToQqEmail
#        - http://127.0.0.1/go-wechat/

2.2 重启skywalking

ps -ef | grep skywalking
kill 8295 8296
sh startup.sh

三、收到告警邮件

重启skywalking后,告警规则即生效。过一会就回收到告警邮件。

在这里插入图片描述
在这里插入图片描述
微服务日志里也会打印日志:
在这里插入图片描述


总结

如果收不到邮件,则从头检查下链路。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

搬砖天才、

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值