使用Prometheus+Grafana+Alertmanager+Webhook-dingtalk搭建监控平台

一、监控平台介绍

1.监控平台简述

普罗米修斯四件套,分别为Prometheus、Grafana、Alertmanager、Webhook-DingTalk。


Prometheus一套开源的监控&报警&时间序列数据库的组合,由SoundCloud公司开发,广泛用于云原生环境和容器化应用的监控和性能分析。其提供了通用的数据模型和快捷数据采集、存储和查询接口。它的核心组件Prometheus-server会定期从静态配置的监控目标或者基于服务发现自动配置的自标中进行拉取数据,当新拉取到的数据大于配置的内存缓存区时,数据就会持久化到存储设备当中。


Grafana是一款用Go语言开发的开源数据可视化工具,可以做数据监控和数据统计,带有告警功能。它主要是用于可视化展示Prometheus存储的数据。


AlertManager是一个开源的告警管理工具,主要用于处理来自于监控系统Prometheus的告警,能够集中管理告警的路由、去重、分组和通知等操作。


Webhook-DingTalk主要支持通过Webhook接口将监控警报信息发送到钉钉群聊中,以便团队成员及时接收和处理警报。它还支持通过设置签名密钥来验证Webhook请求的来源,保障数据安全,同时也提供了自定义模板,可根据不同场景定制警报内容,适应多样化的业务需求。‌


2.项目功能

整个系统分为三部分:Prometheus指标数据采集存储端、Grafana可视化展示端、Alertmanager+Webhook-DingTalk告警通知端。实现对服务器、数据库、应用服务等资源的指标监控、使用图表可视化展示、使用指标告警等功能,同时通过钉钉机器人告警,把信息实时推送到钉钉群中。


3.项目开源地址

https://prometheus.io

二、平台搭建环境

1. 平台搭建环境


本次实测搭建了两种环境的平台。


A.基于联通云的centos7.6,使用docker-compose搭建,具体搭建内容为Prometheus+Grafana+Alertmanager+Webhook-dingtalk组成监控预警系统,配合组件node_export/nginx_export/mysql_export/redis_export/cadvisor监控业务系统。


B.基于国产云的arm64的KylinOS,使用源码部署并且配置成系统服务管理,具体搭建内容为Prometheus+Grafana+Alertmanager+Webhook-dingtalk/邮件组成监控预警系统,配合组件node_export/nginx_export/dameng_export/redis_export/BES_export监控业务系统。


2. 监控平台实施准备


A方案使用docker下载镜像,创建好平台组件以及需要挂载的路径,通过docker-compose启动容器,启动容器后查看容器启动状态,查看容器的运行日志是否正常。(dingtalk镜像没下载下来,直接安装包启动了)


B方案到prometheus.io下载好对应的普罗米修斯相关组件;达梦使用开源的dmdb_exporter-0.1-alpha,并且基于arm64环境编译好了启动包;宝蓝德内嵌jvm读取宝蓝德这边特有的jar包进行数据采集;全部组件均写入系统服务管理。

三、平台实施过程

A方案.

联通云上的centos7.6配合docker-compose搭建Prometheus+Grafana+Alertmanager+Webhook-dingtalk,监控组件node_export/nginx_export/mysql_export/redis_export/cadvisor


1.创建Prometheus的data和config

mkdir -p ./Prometheus/data ./Prometheus/config

2.创建Prometheus的配置文件

vim ./Prometheus/config/prometheus.yml

-----------------------------------------------------------------------------------------------------------------------

global:

  # 指定Prometheus抓取应用程序数据的间隔为15秒。

  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  #

  # Attach these labels to any time series or alerts when communicating with

  # external systems (federation, remote storage, Alertmanager).

  external_labels:

    monitor: 'line-monitor'

  # 普罗米修斯 规则文件

rule_files:

  - "./rules/*.yml"

  # prometheus自身的Alert功能是根据我们配置的 规则文件 进行触发的,但是它并没有告警发邮件的功能,发送邮件的这件事儿是由 Alertmanager来做的

alerting:

  alertmanagers:

  - static_configs:

      - targets:

        - "192.168.1.41:9093"

scrape_configs:

  # 普罗米修斯节点

  - job_name: 'prometheus'

    static_configs:

    - targets: ['192.168.1.41:9090']

  # tomcat的服务器检测节点

  - job_name: 'node-monitor-142jar'

    static_configs:

    - targets: ['192.168.1.41:9100'] #如部署在生产环境,此ip应为部署node-monitor的服务器ip

  # mysql的服务器检测节点

  - job_name: 'node-monitor-80sql'

    static_configs:

    - targets: ['192.168.1.91:9100'] #如部署在生产环境,此ip应为部署node-monitor的服务器ip

  # mysql的服务器检测节点监控mysql

  - job_name: 'mysql-monitor'

    static_configs:

    - targets: ['192.168.1.91:9104'] #如部署在生产环境,此ip应为部署node-monitor的服务器ip

  # tomcat的服务器检测节点监控redis

  - job_name: 'node-monitor-142redis'

    static_configs:

    - targets: ['192.168.1.41:9121'] #如部署在生产环境,此ip应为部署node-monitor的服务器ip

  # tomcat的服务器检测节点监控nginx

  - job_name: 'node-monitor-142nginx'

    static_configs:

    - targets: ['192.168.1.41:9113'] #如部署在生产环境,此ip应为部署node-monitor的服务器ip

  # tomcat的服务器检测节点监控tomcat

  - job_name: 'node-monitor-142tomcat'

    static_configs:

     - targets: ['192.168.1.41:30018'] #如部署在生产环境,此ip应为部署node-monitor的服务器ip

  - job_name: docker #增加对目标主机的监控

    scrape_interval: 5s

    static_configs:

    - targets: ['192.168.1.41:9200'] #目标主机的地址和端口

      labels:

        instance: docker

  - job_name: 'alertmanager' #增加对预警组件的监控

    scrape_interval: 15s

    static_configs:

    - targets: ['192.168.1.41:9093']

-----------------------------------------------------------------------------------------------------------------------

参数说明:

global:

  # 指定Prometheus抓取应用程序数据的间隔为15秒。

  scrape_interval:     15s

  external_labels:

    monitor: 'line-monitor'

  # 普罗米修斯,预警在alertmanagers的规则文件

rule_files:

  - "./rules/*.yml"

  # prometheus本身是没有预警功能的,触发的Alert是根据我们配置文件的,发送邮件、钉钉、企业微信是由alertmanagers

alerting:

  alertmanagers:

  - static_configs:

      - targets:

        - "192.168.1.41:9093"

scrape_configs:

  # 普罗米修斯节点

  - job_name: 'prometheus' #显示在普罗米修斯里面的节点名称

    static_configs:

    - targets: ['192.168.1.41:9090']  #这里配置的ip加端口就是让普罗米修斯能够按照ip和端口读到对应的采集数据的插件

3.创建Alertmanager的配置文件

vim ./Prometheus/config/alertmanager.yml

-----------------------------------------------------------------------------------------------------------------------    

# 全局配置项

global:

  resolve_timeout: 5m  #恢复时候通知时间

#  smtp_smarthost: 'smtphz.qiye.163.com:25'      # smtp地址

  smtp_smarthost: 'smtp.qq.com:587'

  smtp_from: '123456@qq.com'          # 谁发邮件

  smtp_auth_username: '123456@qq.com' # 邮箱用户

  smtp_auth_password: '123456' # 邮箱客户端授权密码

  smtp_require_tls: false  # 进行tls验证

templates:

  - "/config/email.tmpl" # 定义email预警的模板

# 定义路由树信息

route:

  group_by: ['alertname'] # 报警分组依据

  group_wait: 5s # 最初即第一次等待多久时间发送一组警报的通知

  group_interval: 10s # 在发送新警报前的等待时间

  repeat_interval: 1m # 发送重复警报的周期 对于email配置中,此项不可以设置过低,否则将会由于邮件发送太多频繁,被smtp服务器拒绝

  receiver: email #默认使用邮件

  routes:

      - receiver: 'email' #邮件优先

        continue: true # 是否继续

      - receiver: 'dingding' #钉钉也需要通知

receivers: # 接收器列表,定义各种通知渠道如 email、webhook等

- name: 'email' # 邮件

  email_configs:

   - to: '123456@qq.com'  # 接收警报的email配置

     send_resolved: true  # 是否发送恢复的消息

     html: '{{ template "email.to.html" . }}' #使用email的预警模板

- name: 'dingding' #钉钉

  webhook_configs:

   - url: 'http://192.168.1.41:8060/dingtalk/webhook1/send' #这里需要部署钉钉的插件,把ip、端口、webhook1定义的名字改成对应的真实的参数

     send_resolved: true # 是否发送恢复的消息

inhibit_rules: # inhibit_rules 用于配置告警抑制规则。它可以避免发送无意义的重复告警通知

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['alertname', 'dev', 'instance']

-----------------------------------------------------------------------------------------------------------------------

4.创建Alertmanager的邮箱预警模板

vim ./Prometheus/config/email.tmpl

-----------------------------------------------------------------------------------------------------------------------       

{{ define "email.from" }}xxx.com{{ end }}

    {{ define "email.to" }}xxx.com{{ end }}

    {{ define "email.to.html" }}

    {{- if gt (len .Alerts.Firing) 0 -}}

    {{ range .Alerts }}

    ========= 监控报警 =========<br>

    告警级别: {{ .Labels.severity }} <br>

    告警类型: {{ .Labels.alertname }} <br>

    告警主机: {{ .Labels.instance }} <br>

    告警主题: {{ .Annotations.summary }}  <br>

    告警详情: {{ .Annotations.description }} <br>

    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>

    ========= = end =  =========<br>

    {{ end }}{{ end -}}

    {{- if gt (len .Alerts.Resolved) 0 -}}

    {{ range .Alerts }}

    ========= 告警恢复 =========<br>

    告警级别: {{ .Labels.severity }} <br>

    告警类型: {{ .Labels.alertname }} <br>

    告警主机: {{ .Labels.instance }} <br>

    告警主题: {{ .Annotations.summary }} <br>

    告警详情: {{ .Annotations.description }} <br>

    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>

    恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>

    ========= = end =  =========<br>

    {{ end }}{{ end -}}

    {{- end }

-----------------------------------------------------------------------------------------------------------------------   

5.创建Prometheus的预警规则文件

vim ./Prometheus/config/rules/rules.yml

-----------------------------------------------------------------------------------------------------------------------

groups:

- name: 服务器资源监控

  rules:

  - alert: 密码过期

    expr: user_expire_days_root < 3

    for: 5s

    labels:

      severity: 严重告警

    annotations:

      summary: "{{ $labels.instance }} 密码即将过期, 请尽快处理!"

          description: "{{ $labels.instance }} 密码即将过期,距离过期还有{{ $value }}天"  

      - alert: 内存使用率过高

        expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80

        for: 3m 

        labels:

          severity: 严重告警

        annotations:

          summary: "{{ $labels.instance }} 内存使用率过高, 请尽快处理!"

          description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."

      - alert: 服务器宕机

        expr: up == 0

        for: 1s

        labels:

          severity: 严重告警

        annotations:

          summary: "{{$labels.instance}} 服务器宕机, 请尽快处理!"

          description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "

      - alert: CPU高负荷

        expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90

        for: 1m

        labels:

          severity: 严重告警

        annotations:

          summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"

          description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "

      - alert: 磁盘IO性能

        expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90

        for: 1m

        labels:

          severity: 严重告警

        annotations:

          summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理!"

          description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%."

      - alert: 网络流入

        expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400

        for: 5m

        labels:

          severity: 严重告警

        annotations:

          summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!"

          description: "{{$labels.instance}} 流入网络带宽持续5分钟高于100M. RX带宽使用量{{$value}}."

      - alert: 网络流出

        expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400

        for: 5m

        labels:

          severity: 严重告警

        annotations:

          summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!"

          description: "{{$labels.instance}} 流出网络带宽持续5分钟高于100M. RX带宽使用量{$value}}."

      - alert: TCP连接数

        expr: node_netstat_Tcp_CurrEstab > 10000

        for: 2m

        labels:

          severity: 严重告警

        annotations:

          summary: " TCP_ESTABLISHED过高!"

          description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%."

      - alert: 磁盘容量

        expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90

        for: 1m

        labels:

          severity: 严重告警

        annotations:

          summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"

          description: "{{$labels.instance}} 磁盘分区使用大于90%,当前使用率{{ $value }}%."

-----------------------------------------------------------------------------------------------------------------------

6.创建Grafana的data和config

mkdir -p ./grafana/data ./grafana/config

7.创建Grafana的配置文件

vim ./grafana/config/grafana.ini

-----------------------------------------------------------------------------------------------------------------------

[server]

# Protocol (http, https, h2, socket)

protocol = https

#cert_file = /etc/grafana/baidu.cn_bundle.crt

#cert_key  = /etc/grafana/baidu.key

;

;# This is the minimum TLS version allowed. By default, this value is empty. Accepted values are: TLS1.2, TLS1.3. If nothing is set TLS1.2 would be taken

;;min_tls_version = ""

;

;# The ip address to bind to, empty will bind to all interfaces

;;http_addr =

;

;# The http port  to use

http_port = 3000

;

;# The public facing domain name used to access grafana from a browser

domain = www.baidu.com

;

;# Redirect to correct domain if host header does not match domain

;# Prevents DNS rebinding attacks

;;enforce_domain = false

;

;# The full public facing url you use in browser, used for redirects and emails

;# If you use reverse proxy and sub path specify full url (with sub path)

root_url = %(protocol)s://%(domain)s:%(http_port)s/grafana/

;

;# Serve Grafana from subpath specified in `root_url` setting. By default it is set to `false` for compatibility reasons.

serve_from_sub_path = true

;

;# Log web requests

;;router_logging = false

;

;# the path relative working path

;;static_root_path = public

;

;

[auth.proxy]

enabled = true

header_name = X-WEBAUTH-USER

header_property = username

auto_sign_up = true

;sync_ttl = 60

whitelist =

;headers = Email:X-User-Email, Name:X-User-Name

# Non-ASCII strings in header values are encoded using quoted-printable encoding

;headers_encoded = false

# Read the auth proxy docs for details on what the setting below enables

enable_login_token = false

[smtp]

enabled = true

host = smtp.qq.com:587

user = 123456@qq.com

password = 123456

cert_file =

key_file =

skip_verify = false

from_address = 123456@qq.com

from_name = Grafana

ehlo_identity = 

startTLS_policy = 

[emails]

welcome_email_on_sign_up = true

templates_pattern = emails/*.html, emails/*.txt

content_types = text/html

-----------------------------------------------------------------------------------------------------------------------

8.因为nginx_export、redis_export、node_export、cadvisor没有什么特殊的设置需要,我就没有挂载出来了,下面提供docker-compose文件内容。使用这个配置文件,请把对应的挂载路径改成自己的路径

-----------------------------------------------------------------------------------------------------------------------

services:

  prometheus:

    image: prom/prometheus:latest

    container_name: prometheus

    hostname: prometheus

    restart: always

    ports:

      - 9090:9090

    volumes:

      - /data/prometheus/config:/config

      - /data/prometheus/data/prometheus:/prometheus/data

      - /etc/localtime:/etc/localtime:ro

      - /etc/timezone:/etc/timezone

    command:

      - --config.file=/config/prometheus.yml

      - --web.enable-lifecycle

  alertmanager:

    image: prom/alertmanager:latest

    container_name: altermanager

    hostname: altermanager

    restart: always

    ports:

      - 9093:9093

    volumes:

      - /data/prometheus/config:/config

      - /data/prometheus/data/alertmanager:/alertmanager/data

      - /etc/timezone:/etc/timezone

      - /etc/localtime:/etc/localtime:ro

    command:

      - --config.file=/config/alertmanager.yml

  grafana:

    image: grafana/grafana:latest

    container_name: grafana

    hostname: grafana

    restart: always

    ports:

      - 3000:3000

    volumes:

      - /data/grafana/config:/etc/grafana

      - /data/grafana/data:/var/lib/grafana

      - /etc/localtime:/etc/localtime:ro

      - /etc/timezone:/etc/timezone

  node-exporter:

    image: quay.io/prometheus/node-exporter

    container_name: node-exporter

    hostname: node-exporter

    restart: always

    ports:

      - 9100:9100

  redis_exproter:

    image: oliver006/redis_exporter

    container_name: redis_exporter

    restart: always

    ports:

      - 9121:9121

    environment:

      - REDIS_ADDR=redis://redis:6379

  nginx_exproter:

    image: nginx/nginx-prometheus-exporter

    container_name: nginx_exporter

    hostname: nginx_exporter

    command:

      - -nginx.scrape-uri=https://www.baidu.com/stub_status

    restart: always

    ports:

      - 9113:9113

  cadvisor:

    image: google/cadvisor

    container_name: cadvisor

    ports:

      - 9200:8080

    volumes:

      - /dev/disk/:/dev/disk:ro

      - /data/docker/:/var/lib/docker:ro

      - /sys:/sys:ro

      - /var/run:/var/run:ro

      - /:/rootfs:ro

      - /etc/timezone:/etc/timezone

-----------------------------------------------------------------------------------------------------------------------

9.启动docker-compose

docker compose up -d

启动后会自行下载镜像,下载失败的请使用加速器,如果有下载不了的可以自行翻qiang后去下载。

ps:如果实在下载不了的,可以在评论区写下邮箱,我把全部的镜像和源码包发过去。

10.docker下载完后,可以查看对应的镜像是否下载成功并运行

docker compose ps

NAME                IMAGE                             COMMAND                  SERVICE             CREATED             STATUS              PORTS

altermanager        prom/alertmanager:latest          "/bin/alertmanager -…"   alertmanager        4 days ago          Up 4 days           0.0.0.0:9093->9093/tcp, :::9093->9093/tcp

cadvisor            google/cadvisor                   "/usr/bin/cadvisor -…"   cadvisor            4 days ago          Up 3 days           0.0.0.0:9200->8080/tcp, :::9200->8080/tcp

grafana             grafana/grafana:latest            "/run.sh"                grafana             4 days ago          Up 4 days           0.0.0.0:3000->3000/tcp, :::3000->3000/tcp

nginx_exporter      nginx/nginx-prometheus-exporter   "/usr/bin/nginx-prom…"   nginx_exproter      4 days ago          Up 4 days           0.0.0.0:9113->9113/tcp, :::9113->9113/tcp

prometheus          prom/prometheus:latest            "/bin/prometheus --c…"   prometheus          4 days ago          Up 4 days           0.0.0.0:9090->9090/tcp, :::9090->9090/tcp

redis_exporter      oliver006/redis_exporter          "/redis_exporter"        redis_exproter      4 days ago          Up 4 days           0.0.0.0:9121->9121/tcp, :::9121->9121/tcp

mysql-exporter   prom/mysqld-exporter               "/bin/mysqld_exporter"   mysql-exporter   2 months ago   Up 2 months   0.0.0.0:9104->9104/tcp, :::9104->9104/tcp

node-exporter    quay.io/prometheus/node-exporter   "/bin/node_exporter"     node-exporter    2 months ago   Up 2 months   0.0.0.0:9100->9100/tcp, :::9100->9100/tcp

11.安装钉钉插件

先把插件解压到对应目录,然后把webhook做成系统服务去管理

vim /usr/lib/systemd/system/webhook.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=webhook

After=network.target

[Service]

Type=notify

ExecStart=/opt/ssjc/webhook/prometheus-webhook-dingtalk --config.file=/opt/ssjc/webhook/config.yml

Restart=on-failure

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

12.配置钉钉插件

vim ./config.yml

-----------------------------------------------------------------------------------------------------------------------    

## Request timeout

# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)

#no_builtin_template: true

## Customizable templates path

templates:

  - /data/webhook/template.tmpl

## You can also override default template using `default_message`

## The following example to use the 'legacy' template from v0.3.0

#default_message:

#  title: '{{ template "legacy.title" . }}'

#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"

targets:

  webhook1:

    url: https://oapi.dingtalk.com/robot/send?access_token=123123123 #去钉钉群,开启自定义机器人,勾选“加签”,点击完成会生成token地址和secret

    # secret for signature

    secret: 123123123

-----------------------------------------------------------------------------------------------------------------------    

13.配置钉钉预警模板

vim ./template.tmpl

-----------------------------------------------------------------------------------------------------------------------   

{{ define "__subject" }}

[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]

{{ end }}

{{ define "__alert_list" }}{{ range . }}

---

{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}

**告警主题**: {{ .Annotations.summary }}

**告警类型**: {{ .Labels.alertname }}

**告警级别**: {{ .Labels.severity }} 

**告警主机**: {{ .Labels.instance }} 

**告警信息**: {{ index .Annotations "description" }}

**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

{{ end }}{{ end }}

{{ define "__resolved_list" }}{{ range . }}

---

{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}

**告警主题**: {{ .Annotations.summary }}

**告警类型**: {{ .Labels.alertname }} 

**告警级别**: {{ .Labels.severity }}

**告警主机**: {{ .Labels.instance }}

**告警信息**: {{ index .Annotations "description" }}

**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}

{{ end }}{{ end }}

{{ define "default.title" }}

{{ template "__subject" . }}

{{ end }}

{{ define "default.content" }}

{{ if gt (len .Alerts.Firing) 0 }}

**====侦测到{{ .Alerts.Firing | len  }}个故障====**

{{ template "__alert_list" .Alerts.Firing }}

---

{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}

**====恢复{{ .Alerts.Resolved | len  }}个故障====**

{{ template "__resolved_list" .Alerts.Resolved }}

{{ end }}

{{ end }}

{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}

{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}

{{ template "default.title" . }}

{{ template "default.content" . }}

-----------------------------------------------------------------------------------------------------------------------  

14.配置完成后启动webhook并且查看状态

systemctl start webhook && systemctl status webhook 

[root@2 prometheus-compose]# systemctl status webhook

● webhook.service - Prometheus-Server

   Loaded: loaded (/usr/lib/systemd/system/webhook.service; enabled; vendor preset: disabled)

   Active: active (running) since Tue 2024-07-16 08:42:16 CST; 1 weeks 0 days ago

 Main PID: 24585 (prometheus-webh)

    Tasks: 13

   Memory: 11.5M

   CGroup: /system.slice/webhook.service

           └─24585 /pd_data1/procedure/webhook/prometheus-webhook-dingtalk --config.file=/pd_data1/procedure/webhook/config.yml

Jul 18 17:12:18 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-18T09:12:18.301Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 18 17:13:28 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-18T09:13:28.267Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 18 17:14:38 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-18T09:14:38.295Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 18 17:15:48 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-18T09:15:48.256Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 18 17:16:58 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-18T09:16:58.209Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 18 17:17:18 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-18T09:17:18.261Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 18 17:44:18 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-18T09:44:18.291Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 18 17:45:18 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-18T09:45:18.292Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 19 16:36:18 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-19T08:36:18.114Z caller=entry.go:26 level=info component=web http_scheme=http h...Jul 19 16:38:18 2.novalocal prometheus-webhook-dingtalk[24585]: ts=2024-07-19T08:38:18.265Z caller=entry.go:26 level=info component=web http_scheme=http h...Hint: Some lines were ellipsized, use -l to show in full.

15.node节点增加密码过期监控预警设置步骤

a.先到node监控节点下,创建一个获取用户过期天数的脚本

cd /opt/ssjc/node_exporter-1.7.0.linux-arm64 && vim user_expire.sh

-----------------------------------------------------------------------------------------------------------------------  

#!/bin/bash

# 定义一个用户列表,用空格分隔

user_list="root"

# 定义一个输出文件的路径

output_file="/pd_data1/procedure/node_exporter-1.8.2.linuxamd64/user_expire.prom"

# 清空输出文件的内容

echo "" > $output_file

    today=$(date +%s)

# 遍历用户列表

for user in $user_list; do

    # 获取账户名为 user 的过期时间

    user_expire_date=$(chage -l $user | grep "Password expires" | cut -d: -f2)

    # 如果账户永不过期,返回 -99999

    if [ "$user_expire_date" = " never" ]; then

user_expire_day=99999

else

# 否则,将过期日期转换为秒

user_expire_day=$(date -d "$user_expire_date" +%s)

# 返回过期天数

expiration=$(date -d "$user_expire_date" +%s)

user_expire_days=$[(expiration-today)/86400]

    fi

    # 将结果追加到输出文件中,指标名为 user_expire_seconds_user

    echo "user_expire_day_$user $user_expire_day" >> $output_file

echo "user_expire_days_$user $user_expire_days" >> $output_file

done

-----------------------------------------------------------------------------------------------------------------------  

b.给执行权限

chmod +x ./user_expire.sh

c.增加定时任务,到最底下加一行

输入crontab -e

* * * * * /opt/ssjc/node_exporter-1.7.0.linux-arm64/user_expire.sh

d.重启node节点

cd /opt/ssjc/node_exporter-1.7.0.linux-arm64 && nohup ./node_exporter --collector.textfile.directory=/data/node_exporter-1.8.2.linux-amd64/ > nohup.out 2>&1 &

看到以下数据即可

user_expire_day_root 1.7214048e+09

# HELP user_expire_days_root Metric read from /opt/ssjc/node_exporter-

1.7.0.linux-arm64/user_expire.prom

# TYPE user_expire_days_root untyped

user_expire_days_root 1

e.由于在dockercompose上我是没有把node挂载出来,为了方便我直接把node节点down掉,然后重新使用源码包开了一个新的node节点然后使用系统服务去控制它


16.至此A方案上使用的docker-compose搭建Prometheus+Grafana+Alertmanager+Webhook-dingtalk,监控组件node_export/nginx_export/mysql_export/redis_export/cadvisor已经全部部署完成。

B方案.

1.启动Prometheus,修改对应的配置文件(prometheus.yml),配置内容跟上面A方案的基本一致,此处不再重写。

[root@master prometheus]# ll

total 232912

drwxr-xr-x  5 1001 docker       181 Jul 23 15:16 .

drwxrwxrwx  8 root root         145 Jul 19 18:00 ..

-rw-------  1 root root        1137 Jul 22 15:45 backup.rules

drwxr-xr-x  2 1001 docker        38 Nov 16  2023 console_libraries

drwxr-xr-x  2 1001 docker       173 Nov 16  2023 consoles

drwx------ 28 root root        4096 Jul 23 15:00 data

-rw-r--r--  1 1001 docker     11357 Nov 16  2023 LICENSE

-rw-------  1 root root       20965 Feb 21 15:00 nohup.out

-rw-r--r--  1 1001 docker      3773 Nov 16  2023 NOTICE

-rwxr-xr-x  1 1001 docker 122087953 Nov 16  2023 prometheus

-rw-r--r--  1 1001 docker      6711 Jul 22 17:24 prometheus.yml

-rwxr-xr-x  1 1001 docker 116351173 Nov 16  2023 promtool

加入系统服务管理

vim /usr/lib/systemd/system/prometheus.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=Prometheus Monitoring System

Documentation=Prometheus Monitoring System

[Service]

ExecStart=/opt/ssjc/prometheus/prometheus --config.file=/opt/ssjc/prometheus/prometheus.yml --web.listen-address=:8088   --storage.tsdb.path=/opt/ssjc/prometheus/data

Restart=on-failure

ExecReload=/bin/kill -HUP $MAINPID

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

对prometheus进行管理

systemctl start/restart/status prometheus

2.启动Alertmanager,修改对应的配置文件(alertmanager.yml、email.tmpl、rules.yml),配置内容跟上面A方案的基本一致,此处不再重写。

[root@master alertmanager]# ll

total 64032

drwxr-xr-x 3 1001 1002      140 Jul 23 10:05 .

drwxrwxrwx 8 root root      145 Jul 19 18:00 ..

-rwxr-xr-x 1 1001 1002 36231984 Feb 28 19:48 alertmanager

-rw-r--r-- 1 1001 1002     1501 Jul 22 15:11 alertmanager.yml

-rwxr-xr-x 1 1001 1002 29302849 Feb 28 19:49 amtool

drwx------ 2 root root       35 Jul 23 15:26 data

-rw------- 1 root root     1163 Jul 16 11:25 email.tmpl

-rw-r--r-- 1 1001 1002    11357 Feb 28 19:55 LICENSE

-rw-r--r-- 1 1001 1002      457 Feb 28 19:55 NOTICE

-rw------- 1 root root     3639 Jul 19 22:49 rules.yml

加入系统服务管理

vim /usr/lib/systemd/system/alertmanager.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=alertmanager server daemon

After=network.target

[Service]

ExecStart=/opt/ssjc/alertmanager/alertmanager --config.file=/opt/ssjc/alertmanager/alertmanager.yml --storage.path=/opt/ssjc/alertmanager/data

User=root

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

对alertmanager进行管理

systemctl start/restart/status alertmanager

3.启动grafana

[root@master system]# ll /opt/ssjc/grafana/

total 72

drwx------ 11 root root   227 Feb 21 09:59 .

drwxrwxrwx  8 root root   145 Jul 19 18:00 ..

drwxr-xr-x  2 root root    62 Nov 20  2023 bin

drwxr-xr-x  3 root root   107 Feb 21 09:51 conf

drwx------  7 root root    88 Jul 23 15:26 data

-rw-r--r--  1 root root  5436 Nov 20  2023 Dockerfile

drwx------  3 root root    21 Feb 21 09:43 docs

-rw-r--r--  1 root root 34523 Nov 20  2023 LICENSE

-rw-r--r--  1 root root   105 Nov 20  2023 NOTICE.md

drwxr-xr-x  2 root root   254 Nov 20  2023 npm-artifacts

drwx------  6 root root    58 Feb 21 09:43 packaging

drwxr-xr-x  3 root root    78 Nov 20  2023 plugins-bundled

drwxr-xr-x 16 root root   286 Nov 20  2023 public

-rw-r--r--  1 root root  3157 Nov 20  2023 README.md

drwxr-xr-x  7 root root 12288 Nov 20  2023 storybook

-rw-r--r--  1 root root     8 Nov 20  2023 VERSION

加入系统服务管理

vim /usr/lib/systemd/system/grafana-server.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=Grafana

After=network.target

[Service]

Type=notify

ExecStart=/opt/ssjc/grafana/bin/grafana-server -homepath /opt/ssjc/grafana

Restart=on-failure

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

对alertmanager进行管理

systemctl start/restart/status grafana-server

4.启动webhook,修改对应的配置文件(config.yml、template.tmpl),配置内容跟上面A方案的基本一致,此处不再重写。

[root@master system]# ll /opt/ssjc/webhook/

total 18208

drwxr-xr-x 3 3434 3434      136 Jul 16 13:37 .

drwxrwxrwx 8 root root      145 Jul 19 18:00 ..

-rw-r--r-- 1 3434 3434     1299 Apr 21  2022 config.example.yml

-rwxr-xr-x 1 root root     1517 Jul 16 13:37 config.yml

drwxr-xr-x 4 3434 3434       34 Apr 21  2022 contrib

-rw-r--r-- 1 3434 3434    11358 Apr 21  2022 LICENSE

-rwxr-xr-x 1 3434 3434 18618130 Apr 21  2022 prometheus-webhook-dingtalk

-rwxr-xr-x 1 root root     1740 Jul 16 10:39 template.tmpl

加入系统服务管理

vim /usr/lib/systemd/system/webhook.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=webhook

After=network.target

[Service]

Type=notify

ExecStart=/opt/ssjc/webhook/prometheus-webhook-dingtalk --config.file=/opt/ssjc/webhook/config.yml

Restart=on-failure

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

对webhook进行管理

systemctl start/restart/status webhook

5.启动node_export

[root@master system]# ll /opt/ssjc/node_exporter-1.7.0.linux-arm64/

total 27452

drwxr-xr-x 2 1001 1002      119 Jul 18 18:12 .

drwxrwxrwx 8 root root      145 Jul 19 18:00 ..

-rw-r--r-- 1 1001 1002    11357 Nov 13  2023 LICENSE

-rwxr-xr-x 1 1001 1002 19101790 Nov 13  2023 node_exporter

-rw------- 1 root root  8950860 Jul 23 15:38 nohup.out

-rw-r--r-- 1 1001 1002      463 Nov 13  2023 NOTICE

-rwx------ 1 root root       51 Jul 23 15:39 user_expire.prom

-rwx------ 1 root root      987 Jul 18 17:50 user_expire.sh

加入系统服务管理

vim /usr/lib/systemd/system/node_exporter.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=node_exporter

After=network.target

[Service]

Type=notify

ExecStart=/opt/ssjc/node_exporter-1.7.0.linux-arm64/node_exporter --collector.textfile.directory=/opt/ssjc/node_exporter-1.7.0.linux-arm64/

Restart=on-failure

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

对node_exporter进行管理

systemctl start/restart/status node_exporter

6.启动node_export

[root@master system]# ll /opt/ssjc/node_exporter-1.7.0.linux-arm64/

total 27452

drwxr-xr-x 2 1001 1002      119 Jul 18 18:12 .

drwxrwxrwx 8 root root      145 Jul 19 18:00 ..

-rw-r--r-- 1 1001 1002    11357 Nov 13  2023 LICENSE

-rwxr-xr-x 1 1001 1002 19101790 Nov 13  2023 node_exporter

-rw------- 1 root root  8950860 Jul 23 15:38 nohup.out

-rw-r--r-- 1 1001 1002      463 Nov 13  2023 NOTICE

-rwx------ 1 root root       51 Jul 23 15:39 user_expire.prom

-rwx------ 1 root root      987 Jul 18 17:50 user_expire.sh

加入系统服务管理

vim /usr/lib/systemd/system/node_exporter.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=node_exporter

After=network.target

[Service]

Type=notify

ExecStart=/opt/ssjc/node_exporter-1.7.0.linux-arm64/node_exporter --collector.textfile.directory=/opt/ssjc/node_exporter-1.7.0.linux-arm64/

Restart=on-failure

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

对node_exporter进行管理

systemctl start/restart/status node_exporter

7.启动redis_export

[root@host-192-168-2-92 redis_exporter]# ll

total 8356

drwxr-xr-x 2 root root      60 Jul 19 18:04 .

drwxrwxrwx 5 root root    4096 Jul 19 18:01 ..

-rw-r--r-- 1 root root    1063 Oct 20  2021 LICENSE

-rw-r--r-- 1 root root   25031 Oct 20  2021 README.md

-rwxr-xr-x 1 root root 8519680 Oct 20  2021 redis_exporter

加入系统服务管理

vim /usr/lib/systemd/system/redis_exporter.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=redis_exporter

After=network.target

[Service]

Type=notify

ExecStart=/opt/ssjc/redis_exporter/redis_exporter -redis.addr 192.168.2.92:6379 -redis.password 111111

Restart=on-failure

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

对redis_exporter进行管理

systemctl start/restart/status redis_exporter

8.启动dmdb_exporter

[root@host-192-168-2-92 dmdb_exporter-0.1-alpha]# ll

total 23372

drwxrwxr-x 4 root root      251 Jul 17 01:13 .

drwxrwxrwx 5 root root     4096 Jul 19 18:01 ..

-rw-rw-r-- 1 root root     1180 Sep 16  2020 default-metrics.toml

drwxrwxr-x 6 root root     4096 Sep 16  2020 dm

-rwx------ 1 root root 23825411 Jul 17 01:13 dmdb_exporter

-rw-rw-r-- 1 root root      520 Sep 16  2020 Dockerfile

-rw-rw-r-- 1 root root      319 Sep 16  2020 .gitignore

-rw-rw-r-- 1 root root      291 Sep 16  2020 go.mod

-rw-rw-r-- 1 root root    41346 Sep 16  2020 go.sum

drwxrwxr-x 2 root root       99 Sep 16  2020 .idea

-rw-rw-r-- 1 root root    13721 Sep 16  2020 main.go

-rw-rw-r-- 1 root root      338 Sep 16  2020 metric-dual-example.toml

-rw-rw-r-- 1 root root      497 Sep 16  2020 multi-metric-dual-example-labels.toml

-rw-rw-r-- 1 root root     8327 Sep 16  2020 README.md

配置文件default-metrics.toml,写了需要查询的表空间,其他需要加的话自己写sql

-----------------------------------------------------------------------------------------------------------------------

[[metric]]

context = "tablespace"

labels = [ "tablespace_name" ]

metricsdesc = { free_space = "Generic counter metric of tablespaces free space MB in DmService.", total_space = "Generic counter metric of tablespaces total space MB in DmService.", free_percent = "Generic counter metric of tablespaces free percent in DmService." }

request = '''

select

t.name tablespace_name,

sum(d.free_size*SF_GET_PAGE_SIZE()/1024/1024)  free_space,

sum(d.total_size*SF_GET_PAGE_SIZE()/1024/1024)  total_space,

sum(d.free_size*100)/sum(d.total_size)  free_percent

from

v$tablespace t,

    v$datafile d

where

t.id=d.group_id

group by t.name;

'''

-----------------------------------------------------------------------------------------------------------------------

加入系统服务管理

vim /usr/lib/systemd/system/dmdb_exporter.service

-----------------------------------------------------------------------------------------------------------------------

[Unit]

Description=dmdb_exporter

After=network.target

[Service]

# 配置环境变量

Environment=DATA_SOURCE_NAME=dm://sysdba:123456@localhost:5236?autoCommit=true

# 启动脚本

ExecStart=/opt/dmdb_exporter-0.1-alpha/dmdb_exporter --log.level=info --default.metrics=/opt/dmdb_exporter-0.1-alpha/default-metrics.toml --web.listen-address=:9161

Restart=always

RestartSec=5

StartLimitInterval=0

StartLimitBurst=10

[Install]

WantedBy=multi-user.target

-----------------------------------------------------------------------------------------------------------------------

对dmdb_exporter进行管理

systemctl start/restart/status dmdb_exporter

ps:达梦的这个开源插件需要使用go环境编译

# 进入dmdb_exporter目录

go build -o dmdb_exporter


9.至此B方案上基于国产云的arm64的KylinOS,使用源码部署并且配置成系统服务管理,具体搭建内容为Prometheus+Grafana+Alertmanager+Webhook-dingtalk/邮件组成监控预警系统,配合组件node_export/nginx_export/dameng_export/redis_export/BES_export监控业务系统。已经全部部署完成。

四、平台使用

1.Prometheus

使用ip加9090访问


a.点击Alerts,显示的是/config/rules/rules.yml > 服务器资源监控,显示的是你预警的规则

如:

name: 密码过期

expr: user_expire_days_root < 3

for: 5s

labels:

severity: 严重告警

annotations:

description: {{ $labels.instance }} 密码即将过期,距离过期还有{{ $value }}天

    summary: {{ $labels.instance }} 密码即将过期, 请尽快处理!


b.点击status下的targets,会显示当前普罗米修斯配置文件内监控的节点

如:

alertmanager (0/1 up)

alertmanager (1/1 up)


c.点击graph,可以自定义查询key-value,并且可以显示这个值的趋势图表

如:

mysql_global_variables_sql_quote_show_create{instance="192.168.1.23:9104", job="mysql-monitor"}


d.点击status下的rules,会显示当前预警规则的配置文件内容

如:

alert:CPU高负荷

expr:100 - (avg by (instance, job) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90

for: 1m

labels:

    severity: 严重告警

annotations:

    description: {{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%.

        summary: {{$labels.instance}} CPU使用率过高,请尽快处理!


2.Grafana

使用ip加3000访问


先到datasource里面去连接prometheus\redis等,然后常用的grafana的模板如下:

mysql使用7236、11323;

redis连接后使用自带的模板;

nginx使用12708;

linux使用8919;

docker监控使用193;

达梦使用19109;


3.Alertmanager

使用ip加9093访问


首页显示的Alerts界面,如果有预警的内容,可以看到具体是哪个服务器发出的预警


4.dingtalk和邮件

在钉钉群的机器人,添加自定义机器人,通过webhook接入自定义服务。

邮件需要先开启运维邮件smtp以及相关授权码

五、监控平台模拟图

<think>嗯,用户想在Ubuntu上配置PrometheusGrafana,并设置钉钉告警。首先,我需要回忆一下PrometheusGrafana的安装步骤。Prometheus通常是通过下载二进制包,解压然后配置服务文件来安装的。Grafana的话,可能得添加APT仓库然后用apt安装,或者下载deb包。 然后,钉钉告警的话,PrometheusAlertmanager需要配置webhook指向钉钉机器人。记得钉钉机器人需要生成一个webhook地址,这个地址要放在Alertmanager配置里。同时,Grafana本身也支持钉钉告警,可能需要配置通知渠道,不过用户可能更倾向于用PrometheusAlertmanager来做,所以得确认一下步骤。 接下来,用户提到的引用中有关于Prometheus配置的内容,比如引用[2]里的scrape_configs,可能需要配置目标。还有引用[4]提到Pushgateway,但用户的问题里暂时没提到这个,可能暂时不需要考虑,但可以留作扩展。 安装Prometheus的步骤应该是:下载、解压、创建用户和目录、配置systemd服务。然后验证是否运行在9090端口。Grafana的话,安装后要启动服务,默认端口3000,然后添加Prometheus数据源。 然后是Alertmanager配置,下载安装,配置钉钉的webhook。需要创建一个alertmanager.yml文件,里面设置接收器和路由。同时,Prometheus配置文件需要指向Alertmanager,并配置告警规则。 钉钉机器人方面,用户需要在钉钉群里添加自定义机器人,获取webhook URL,可能还需要设置关键词或IP安全设置。然后把这个URL放到Alertmanager配置里,用Prometheus的模板生成告警信息。 测试部分,可能需要手动触发一个告警,比如让某个节点的exporter停止,看是否能够收到通知。或者使用一个测试的告警规则,比如up状态为0时触发。 注意事项方面,防火墙要开放相关端口,PrometheusAlertmanagerGrafana的端口分别是9090、9093、3000。权限问题,可能涉及文件权限和SELinux,但Ubuntu默认可能没开SELinux,所以主要检查文件权限和目录所有权。 还有,用户可能需要配置Prometheus的scrape_configs来监控目标,比如节点导出器,或者应用指标。这部分在引用[2]里有提到静态配置,可以指导用户如何添加监控目标。 最后,生成相关问题的时候,可以考虑扩展监控目标的添加、高可用部署、与其他通知集成的可能性,或者性能优化等问题。</think>### 分步指南:Ubuntu 系统搭建 Prometheus + Grafana 并集成钉钉告警 --- #### 一、安装 Prometheus 1. **下载并解压** ```bash wget https://github.com/prometheus/prometheus/releases/download/v2.53.2/prometheus-2.53.2.linux-amd64.tar.gz tar xvfz prometheus-*.tar.gz mv prometheus-* /opt/prometheus ``` 2. **创建系统用户与配置文件** ```bash sudo useradd --no-create-home --shell /bin/false prometheus sudo mkdir /etc/prometheus sudo cp /opt/prometheus/prometheus.yml /etc/prometheus/ ``` 3. **配置 systemd 服务** 创建文件 `/etc/systemd/system/prometheus.service`: ```ini [Unit] Description=Prometheus After=network.target [Service] User=prometheus ExecStart=/opt/prometheus/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/data Restart=always [Install] WantedBy=multi-user.target ``` 启动服务: ```bash sudo systemctl daemon-reload sudo systemctl start prometheus sudo systemctl enable prometheus ``` --- #### 二、安装 Grafana 1. **添加仓库并安装** ```bash sudo apt-get install -y apt-transport-https wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add - echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list sudo apt-get update sudo apt-get install grafana ``` 2. **启动服务** ```bash sudo systemctl start grafana-server sudo systemctl enable grafana-server ``` 3. **访问 Grafana** 浏览器打开 `http://<服务器IP>:3000`,默认账号/密码:`admin/admin`。 --- #### 三、配置钉钉告警 1. **创建钉钉机器人** - 在钉钉群设置中选择「机器人」→「自定义」→ 设置关键词(如 `Alert`)并记录 Webhook URL[^4]。 2. **安装 Alertmanager** ```bash wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz tar xvfz alertmanager-*.tar.gz mv alertmanager-*/alertmanager /usr/local/bin/ ``` 3. **配置 Alertmanager** 创建 `/etc/alertmanager/alertmanager.yml`: ```yaml route: group_by: ['alertname'] receiver: 'dingtalk' receivers: - name: 'dingtalk' webhook_configs: - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN' send_resolved: true ``` 4. **修改 Prometheus 告警规则** 在 `prometheus.yml` 中添加: ```yaml alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] rule_files: - /etc/prometheus/alert_rules.yml ``` 示例告警规则文件 `/etc/prometheus/alert_rules.yml`: ```yaml groups: - name: example rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" ``` 5. **重启服务** ```bash sudo systemctl restart prometheus alertmanager ``` --- #### 四、验证告警 1. **触发测试告警** 停止某个被监控的 Exporter(如 Node Exporter),观察钉钉是否收到消息。 2. **Grafana 告警集成(可选)** 在 Grafana 中添加钉钉通知渠道: - 进入「Alerting」→「Notification channels」→ 选择「DingDing」类型并填写 Webhook URL。 --- #### 注意事项 1. **防火墙开放端口**:确保 `9090`(Prometheus)、`3000`(Grafana)、`9093`(Alertmanager)端口开放。 2. **权限问题**:若使用非默认路径,需确保目录权限归属正确[^2]。 3. **高可用扩展**:可通过 Pushgateway 处理短生命周期任务的指标采集[^4]。 --- §§ 1. 如何添加自定义监控目标到 Prometheus? 2. Prometheus 如何实现高可用部署? 3. Grafana 支持哪些其他告警通知渠道? 4. 如何优化 Prometheus 的存储性能? [^1]: [Grafana 作为可视化工具的优势](#) [^2]: [Prometheus 配置文件重载方法](#) : [Pushgateway 的作用与配置](#)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

墨痕诉清风

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值