记一次alertmanager发送邮件失败的处理过程

本文详细记录了解决alertmanager告警邮件发送失败的过程,包括SMTP配置验证、alertmanager配置调整、错误信息解读及解决方案,涉及SMTP认证、端口问题和配置文件对应问题的解决。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

0 说明

环境说明

本文的alertmanager和对应的prometheus都是容器化部署的 使用的是k8s(华为云的CCE)
为了减少篇幅,下文中基础操作,诸如重启pod、修改configmap、修改deployment等已略去

通读并理解如下报错和解决办法 需要读者具备如下基础知识

- 容器化基础
  - 编辑configmap deployment的指令
  - 其他基础指令 如 get / logs / delete
  - 理解并可以配置deployment的volume
  - docker的基础命令
- python基础 
- prometheus监控基础

阅读说明

如下的过程记录了我遇到alertmanager不发送告警邮件后完整的排查过程,也贴有详细的报错信息,如果有时间可以通读,没时间可以直接在全文搜索你自己的报错信息,快速定位问题。

1 先验证smtp信息是否正确

我用如下脚本验证了我自己的smtp信息ok 是可以发送邮件的

#!/usr/bin/python3
import smtplib
from email.mime.text import MIMEText

# 第三方 SMTP 服务
mail_host = "smtp.xxx.com"  # SMTP服务器
mail_user = "xxx@xxx.com"  # 邮箱地址
mail_pass = "smtp_password"  # smtp服务器授权密码

sender = "xxx@xxx.com"  # 邮箱地址
receivers = ['xxx.yyy@zzz.com']  # 接收人邮箱


content = 'Python Send Mail !'
title = 'Python SMTP Mail Test'  # 邮件主题
message = MIMEText(content, 'plain', 'utf-8')  # 内容, 格式, 编码
message['From'] = "{}".format(sender)
message['To'] = ",".join(receivers)
message['Subject'] = title

try:
    smtpObj = smtplib.SMTP_SSL(mail_host, 465)  # 启用SSL发信, 端口一般是465
    smtpObj.login(mail_user, mail_pass)  # 登录验证
    smtpObj.sendmail(sender, receivers, message.as_string())  # 发送
    print("mail has been send successfully.")
except smtplib.SMTPException as e:
    print(e)

2 配置alertmanager配置文件并触发告警

配置好之后如下所示

在这里插入图片描述
如下图 可以看到alertmanager已经有告警了
在这里插入图片描述
但是确实么有收到邮件 于是去看alertmanager的报错 信息如下 一直说我wrong host name 我看了配置文件 确实没有配错

level=info ts=2022-03-23T02:16:37.525994901Z caller=main.go:155 msg="Starting Alertmanager" version="(version=0.11.0, branch=HEAD, revision=30dd0426c08b6479d9a26259ea5efd63bc1ee273)"
level=info ts=2022-03-23T02:16:37.527762474Z caller=main.go:156 build_context="(go=go1.9.2, user=root@3e103e3fc918, date=20171116-17:43:56)"
level=info ts=2022-03-23T02:16:37.539722557Z caller=main.go:293 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2022-03-23T02:16:37.545130043Z caller=main.go:368 msg=Listening address=:9093
level=error ts=2022-03-23T02:17:30.639020284Z caller=notify.go:302 component=dispatcher msg="Error on notify" err="*smtp.plainAuth failed: wrong host name"
level=error ts=2022-03-23T02:17:30.639083975Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="*smtp.plainAuth failed: wrong host name"
level=error ts=2022-03-23T02:17:40.639237563Z caller=notify.go:302 component=dispatcher msg="Error on notify" err="*smtp.plainAuth failed: wrong host name"
level=error ts=2022-03-23T02:17:40.639277826Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="*smtp.plainAuth failed: wrong host name"
level=error ts=2022-03-23T02:17:50.639425498Z caller=notify.go:302 component=dispatcher msg="Error on notify" err="*smtp.plainAuth failed: wrong host name"

3 解决 smtp.plainAuth failed: wrong host name

在这个博客(https://blog.csdn.net/qq_22543991/article/details/88356928)中 看到了解决办法 于是参考下 把自己的镜像也升级到v0.16.1

在这里插入图片描述
再次查看alertmanager 又有新的问题 它说连接127.0.0.1:5001失败 单我的deployment根本没有5001端口

level=info ts=2022-03-23T02:34:41.389240548Z caller=main.go:177 msg="Starting Alertmanager" version="(version=0.16.1, branch=HEAD, revision=571caec278be1f0dbadfdf5effd0bbea16562cfc)"
level=info ts=2022-03-23T02:34:41.389681827Z caller=main.go:178 build_context="(go=go1.11.5, user=root@3000aa3a06c5, date=20190131-15:05:40)"
level=info ts=2022-03-23T02:34:41.394313005Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=192.168.0.180 port=9094
level=info ts=2022-03-23T02:34:41.487475426Z caller=cluster.go:632 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2022-03-23T02:34:41.589155367Z caller=main.go:334 msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2022-03-23T02:34:41.592776954Z caller=main.go:428 msg=Listening address=:9093
level=info ts=2022-03-23T02:34:43.487799541Z caller=cluster.go:657 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000116631s
level=info ts=2022-03-23T02:34:51.488389096Z caller=cluster.go:649 component=cluster msg="gossip settled; proceeding" elapsed=10.00071002s
level=error ts=2022-03-23T02:35:30.636722077Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="Post http://127.0.0.1:5001/: dial tcp 127.0.0.1:5001: connect: connection refused"
level=error ts=2022-03-23T02:35:30.636795707Z caller=dispatch.go:177 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Post http://127.0.0.1:5001/: dial tcp 127.0.0.1:5001: connect: connection refused"
level=error ts=2022-03-23T02:35:40.636944614Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="Post http://127.0.0.1:5001/: dial tcp 127.0.0.1:5001: connect: connection refused"
level=error ts=2022-03-23T02:35:40.637012969Z caller=dispatch.go:177 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Post http://127.0.0.1:5001/: dial tcp 127.0.0.1:5001: connect: connection refused"

容器中的端口如下 是9093 9094

[root@xxxxx ~]# kubectl -n monitoring exec alertmanager-7d5c68df6f-4dzl7 -it -- sh
/alertmanager $ netstat -anpt | grep LISTEN
tcp        0      0 :::9093                 :::*                    LISTEN      1/alertmanager
tcp        0      0 :::9094                 :::*                    LISTEN      1/alertmanager

于是怀疑是不是新换的镜像 用了新的配置文件 新的配置文件中有5001端口

4 解决 dial tcp 127.0.0.1:5001: connect: connection refused

先查看下我自己的deployment中配置文件的设置 如下 配置文件是/etc/alertmanager/config.yml

        volumeMounts:
        - mountPath: /etc/localtime
          name: localtime
          readOnly: true
        - mountPath: /etc/alertmanager/config.yml
          name: alertmanager-conf
          readOnly: true
          subPath: config.yml

再去alertmanager容器所在的主机上 查看下对应容器的信息 如下

[root@yyyyy ~]# docker ps | grep alertmanager
2d074e0ab797        rancher/prom-alertmanager                             "/bin/alertmanager -…"   3 minutes ago       Up 3 minutes                            k8s_container-0_alertmanager-7d5c68df6f-4dzl7_monitoring_bcc9065c-52d6-4ee0-ab12-9e8bbe2b09a8_0
003d2bba2ebd        cce-pause:3.1                                         "/pause"                 4 minutes ago       Up 4 minutes                            k8s_POD_alertmanager-7d5c68df6f-4dzl7_monitoring_bcc9065c-52d6-4ee0-ab12-9e8bbe2b09a8_0
[root@yyyyy ~]# docker inspect 2d074e0ab797
[
    {
        "Id": "2d074e0ab797ba255fdeedbf68502fc82913ed24bc65704a835da9bd2345ff92",
        "Created": "2022-03-23T02:34:40.824251211Z",
        "Path": "/bin/alertmanager",
        "Args": [
            "--config.file=/etc/alertmanager/alertmanager.yml",
            "--storage.path=/alertmanager"
        ],
        "State": {
            "Status": "running",
            "Running": true,
...... 后边的内容省略

从上边的信息可以看出 新的镜像用的配置文件是/etc/alertmanager/alertmanager.yml 配置文件不对应 没有读取到正确的配置

5 解决 配置文件不对应的问题

配置文件既然不对应 那就想办法让它对应 这里采用的方法是 改自己挂载的configmap

更改deployment 将挂载volume的地方进行修改 修改为如下

        volumeMounts:
        - mountPath: /etc/localtime
          name: localtime
          readOnly: true
        - mountPath: /etc/alertmanager/alertmanager.yml
          name: alertmanager-conf
          readOnly: true
          subPath: alertmanager.yml

更改后 pod启动失败 报错如下 说Are you trying to mount a directory onto a file

  Type     Reason                 Age                From                   Message
  ----     ------                 ----               ----                   -------
  Normal   Scheduled              36s                                       Successfully assigned monitoring/alertmanager-5c4664d45f-gq9fn to 172.16.0.216
  Normal   SuccessfulMountVolume  35s (x2 over 36s)  kubelet, 172.16.0.216  Successfully mounted volumes for pod "alertmanager-5c4664d45f-gq9fn_monitoring(b78797dd-7ce8-44fc-b23b-a8bb0c3b5e7e)"
  Warning  FailedStart            35s                kubelet, 172.16.0.216  Error: failed to start container "container-0": Error response from daemon: OCI runtime create failed: container_linux.go:330: starting container process caused "process_linux.go:381: container init caused \"rootfs_linux.go:61: mounting \\\"/mnt/paas/kubernetes/kubelet/pods/b78797dd-7ce8-44fc-b23b-a8bb0c3b5e7e/volume-subpaths/alertmanager-conf/container-0/1\\\" to rootfs \\\"/var/lib/docker/devicemapper/mnt/8f3de32f6add251b9b040c16fc26a86d653c850550f075d05831459b8fb17a83/rootfs\\\" at \\\"/var/lib/docker/devicemapper/mnt/8f3de32f6add251b9b040c16fc26a86d653c850550f075d05831459b8fb17a83/rootfs/etc/alertmanager/alertmanager.yml\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type
  Warning  FailedStart            34s                kubelet, 172.16.0.216  Error: failed to start container "container-0": Error response from daemon: OCI runtime create failed: container_linux.go:330: starting container process caused "process_linux.go:381: container init caused \"rootfs_linux.go:61: mounting \\\"/mnt/paas/kubernetes/kubelet/pods/b78797dd-7ce8-44fc-b23b-a8bb0c3b5e7e/volume-subpaths/alertmanager-conf/container-0/1\\\" to rootfs \\\"/var/lib/docker/devicemapper/mnt/c5d9b7e08e9d64bcda975995bf695b2292bf9745e5f5d8644e591dd7ee83fdc1/rootfs\\\" at \\\"/var/lib/docker/devicemapper/mnt/c5d9b7e08e9d64bcda975995bf695b2292bf9745e5f5d8644e591dd7ee83fdc1/rootfs/etc/alertmanager/alertmanager.yml\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type
  Normal   Pulled                 20s (x3 over 35s)  kubelet, 172.16.0.216  Container image "rancher/prom-alertmanager:v0.16.1" already present on machine
  Normal   SuccessfulCreate       20s (x3 over 35s)  kubelet, 172.16.0.216  Created container container-0
  Warning  FailedStart            19s                kubelet, 172.16.0.216  Error: failed to start container "container-0": Error response from daemon: OCI runtime create failed: container_linux.go:330: starting container process caused "process_linux.go:381: container init caused \"rootfs_linux.go:61: mounting \\\"/mnt/paas/kubernetes/kubelet/pods/b78797dd-7ce8-44fc-b23b-a8bb0c3b5e7e/volume-subpaths/alertmanager-conf/container-0/1\\\" to rootfs \\\"/var/lib/docker/devicemapper/mnt/04bb9820df0c996854cebfbe4606cc0fc56e9791b83af3183ea5ca70f56374c4/rootfs\\\" at \\\"/var/lib/docker/devicemapper/mnt/04bb9820df0c996854cebfbe4606cc0fc56e9791b83af3183ea5ca70f56374c4/rootfs/etc/alertmanager/alertmanager.yml\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type
  Warning  BackOffStart           6s (x2 over 34s)   kubelet, 172.16.0.216  the failed container exited with ExitCode: 127
  Warning  BackOffStart           6s (x2 over 34s)   kubelet, 172.16.0.216  Back-off restarting failed container

看了眼alertmanager的configmap 原来文件名字不对应

apiVersion: v1
data:
  config.yml: |
    global:
      resolve_timeout: 5m
      smtp_from: xxx.yyy@cccc.com
      smtp_smarthost: 'smtp.xxx.com:465'
      smtp_auth_username: xxx.yyy@cccc.com
      smtp_auth_password: smtp_password
      smtp_require_tls: false
    route:
      receiver: email
      group_by:
      - alertname
      - kubernetes_namespace
      - kubernetes_pod
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
    inhibit_rules:

6 解决configmap跟挂载文件名不对应的问题

修改alertmanager的configmap 将配置文件吗修改为alertmanager.yml

apiVersion: v1
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      smtp_from: xxx.yyy@cccc.com
      smtp_smarthost: 'smtp.xxx.com:465'
      smtp_auth_username: aaa.bbb@ccc.com
      smtp_auth_password: smtp_password
      smtp_require_tls: false
    route:
      receiver: email
      group_by:
      - alertname
      - kubernetes_namespace
      - kubernetes_pod
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
    inhibit_rules:

重启alertmanager pod

一看又失败了 信息如下 原来是yaml中volumes的地方没修改过来

  Warning  FailedMount  25s                  kubelet, 172.16.0.216  Unable to attach or mount volumes: unmounted volumes=[alertmanager-conf], unattached volumes=[localtime alertmanager-conf default-token-z4nz6]: timed out waiting for the condition
  Warning  FailedMount  20s (x9 over 2m28s)  kubelet, 172.16.0.216  MountVolume.SetUp failed for volume "alertmanager-conf" : configmap references non-existent config key: config.yml

修改完之后如下 再次重启alertmanager


更改完毕后 看到了Running
在这里插入图片描述
alertmanager页面告警如下
在这里插入图片描述

也收到了告警邮件
在这里插入图片描述

<think>嗯,用户在使用AlertmanagerConfig自定义配置时遇到了问题,配置email_configs报错,但webhookConfigs没问题。我需要分析可能的原因。 首先,回忆AlertmanagerConfig的API版本。用户使用的是monitoring.coreos.com/v1alpha1,可能这个版本已经废弃或者不支持email_configs。因为Kubernetes Operator的API版本常有变动,比如v1alpha1可能被v1beta1或v1取代,新版本可能移除了某些配置。 然后,检查AlertmanagerConfig的CRD定义。不同的API版本支持的接收器类型不同。在v1alpha1中可能只支持webhookConfigs,而emailConfigs可能在更高版本中才被引入,或者需要特定的配置方式。用户可能参考了过时的文档或示例,导致使用了不支持的字段。 接着,考虑Prometheus Operator的版本。如果用户的环境中的Operator版本较旧,可能不支持email配置。需要确认版本兼容性,并查看对应版本的官方文档,确认接收器支持的类型。 另外,查看错误信息也很重要。用户提到报错,但具体错误信息未给出。通常,创建资源时的报错信息会提示字段不被识别,这可能是因为CRD中未定义emailConfigs字段,导致验证失败。 最后,解决方案可能包括升级Prometheus Operator到支持emailConfigs的版本,或者改用webhook然后通过其他方式转发邮件通知。同时,检查Alertmanager主配置是否启用了SMTP设置,因为即使CRD支持,也需要全局的SMTP配置。</think>您遇到的`email_configs`报错问题主要由API版本与字段支持不匹配导致。让我们逐步分析原因并提供解决方案: **根本原因分析:** 1. **API版本兼容性问题**: - `monitoring.coreos.com/v1alpha1` 是较旧的API版本 - 该版本中`AlertmanagerConfig`的`receivers`仅支持`webhookConfigs`,不支持`email_configs` - 较新的API版本(如v1beta1或v1)才开始支持邮件通知配置 2. **CRD字段验证机制**: - Kubernetes会校验CRD中定义的字段 - 旧版本CRD未定义`email_configs`字段,导致创建资源时报错 **验证方法:** ```bash # 查看集群中实际支持的字段 kubectl explain alertmanagerconfig.spec.receivers ``` **解决方案:** 1. **升级Prometheus Operator**: - 升级到v0.60+版本(需要k8s 1.19+) - 使用新API版本`monitoring.coreos.com/v1beta1` 2. **改用支持Email的新配置格式**: ```yaml apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig spec: receivers: - name: yunwei emailConfigs: - to: '[email protected]' sendResolved: true smtpConfig: authIdentity: "[email protected]" authPassword: name: smtp-secret key: password host: smtp.qq.com port: 465 requireTLS: true ``` 3. **混合使用Webhook中转**(临时方案): ```yaml webhookConfigs: - url: "http://email-webhook-adapter/" # 中间服务将Webhook请求转换为邮件发送 ``` **必要前提条件:** 1. Alertmanager主配置中必须配置SMTP全局设置: ```yaml alertmanager: config: global: smtp_smarthost: 'smtp.qq.com:465' smtp_from: '[email protected]' smtp_auth_username: '[email protected]' smtp_auth_password: 'your-auth-code' # QQ邮箱需用授权码 smtp_require_tls: true ``` **版本对照表**: | Operator版本 | 推荐API版本 | 支持Email | |-------------|-------------|-----------| | <0.50 | v1alpha1 | ❌ | | 0.50-0.59 | v1alpha1 | ❌ | | ≥0.60 | v1beta1 | ✅ | 建议优先升级Operator版本并切换API版本,这是获得完整通知功能的最可靠方式。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值