KubeSphere 集群监控 502 报错排查实录：基于 WhizardTelemetry 的 Prometheus 端点故障与重装方案-CSDN博客

本文链接：https://blog.csdn.net/gs80140/article/details/149742848

1. Prometheus Endpoints 为空

2. Prometheus Operator 日志报权限错误

3. Prometheus CR 存在命名空间错位

解决方案

1. 卸载现有 WhizardTelemetry

2. 清理残留资源

3. 重装 WhizardTelemetry

4. 验证监控恢复

经验总结

结语

背景

在 KubeSphere 平台中，WhizardTelemetry 是官方团队推出的可观测平台扩展组件，负责提供多租户视角的云原生监控能力。它整合了 Prometheus、可观测中心等模块，为多集群、节点、工作负载和 Kubernetes 控制面的核心指标提供实时与历史数据展示。

最近一次线上环境中，监控接口返回大量 502 Bad Gateway 错误，导致资源监控页面无法加载指标，影响了运维与故障分析。本文记录了完整的排查过程、关键脚本及最终解决方案，供后续类似问题参考。

问题现象

调用监控 API：

/kapis/monitoring.kubesphere.io/v1beta1/cluster_metrics

返回结果为：

"error": "server_error: server error: 502"

KubeSphere API Server 日志显示：

Error while proxying request: dial tcp 10.233.2.51:9090: i/o timeout

说明 API Server 无法通过 Service 代理访问 Prometheus 9090 端口。

快速排查脚本

为了高效排查问题，我编写了一个一键检测脚本，自动检查以下内容：

Prometheus Pod 是否存在及状态
Service 与 Endpoints 是否匹配
API Server 是否能直连 Prometheus
CNI 网络插件状态
资源占用与日志关键错误

脚本示例：

#!/bin/bash
NAMESPACE="kubesphere-monitoring-system"
PROM_SVC="prometheus-k8s"
KS_API_NAMESPACE="kubesphere-system"

echo "=== 1. 检查 Prometheus Pod ==="
kubectl get pods -n $NAMESPACE -o wide | grep prometheus

echo "=== 2. 检查 Service 和 Endpoints ==="
kubectl get svc -n $NAMESPACE | grep $PROM_SVC
kubectl get endpoints -n $NAMESPACE $PROM_SVC -o wide

echo "=== 3. 检查 ks-apiserver Pod 状态 ==="
kubectl get pods -n $KS_API_NAMESPACE -o wide | grep ks-apiserver

PROM_IP=$(kubectl get endpoints -n $NAMESPACE $PROM_SVC -o jsonpath='{.subsets[0].addresses[0].ip}')
KS_API_POD=$(kubectl get pods -n $KS_API_NAMESPACE -o name | grep ks-apiserver | head -n1)

if [ -n "$PROM_IP" ] && [ -n "$KS_API_POD" ]; then
  echo "=== 4. 测试 API Server 到 Prometheus 连通性 ==="
  kubectl exec -n $KS_API_NAMESPACE $KS_API_POD -- curl -s --max-time 5 http://$PROM_IP:9090/-/ready || echo "❌ 无法访问 Prometheus"
fi

echo "=== 5. 检查 CNI 插件状态 ==="
kubectl get pods -n kube-system | grep -E 'calico|flannel|cilium'

echo "=== 6. 检查资源使用 ==="
kubectl top pods -n $NAMESPACE | grep prometheus || echo "❌ metrics-server 未安装"

echo "=== 7. 查看 Prometheus 日志关键错误 ==="
PROM_POD=$(kubectl get pods -n $NAMESPACE -o name | grep prometheus | head -n1)
if [ -n "$PROM_POD" ]; then
  kubectl logs -n $NAMESPACE $PROM_POD --tail=50 | grep -iE "error|fail|timeout"
else
  echo "❌ 未找到 Prometheus Pod"
fi

执行后能快速发现核心问题。

排查过程

1. Prometheus Endpoints 为空

prometheus-k8s Service 无任何 endpoints
表示无匹配 Pod，API Server 代理超时

2. Prometheus Operator 日志报权限错误

cannot list resource "nodes" in API group "" at the cluster scope

说明 prometheus-operator ServiceAccount 权限不足，无法同步节点信息，导致 Prometheus 实例无法正常创建。

3. Prometheus CR 存在命名空间错位

kubectl get prometheus -A

结果：

monitoring 命名空间下存在 prometheus k8s
但 kubesphere-monitoring-system 命名空间为空
Service/Endpoints 逻辑基于 kubesphere-monitoring-system，导致无法绑定

解决方案

最终决定 卸载并重装 WhizardTelemetry 监控组件(可以在kubesphere界面上手工点击卸载按钮然后重装)，重建完整的监控栈（Prometheus + Operator + Service）。核心步骤：

1. 卸载现有 WhizardTelemetry

kubectl delete ns monitoring
kubectl delete ns kubesphere-monitoring-system
# 或使用 helm 卸载
helm uninstall whizard-monitoring-agent -n kubesphere-monitoring-system

2. 清理残留资源

检查并删除残留 CRD 与 ClusterRoleBinding
确认 prometheus-k8s、alertmanager-k8s、thanos 相关资源已清理

3. 重装 WhizardTelemetry

按照 KubeSphere 官方文档或 Helm Chart 重新安装：

helm repo add kubesphere https://charts.kubesphere.io/main
helm install whizard-monitoring-agent kubesphere/whizard-monitoring-agent -n kubesphere-monitoring-system --create-namespace

4. 验证监控恢复

检查 Pod：

kubectl get pods -n kubesphere-monitoring-system

检查 Endpoints：

kubectl get endpoints prometheus-k8s -n kubesphere-monitoring-system

API 调用恢复正常：

/kapis/monitoring.kubesphere.io/v1beta1/cluster_metrics

经验总结

命名空间一致性关键：Prometheus CR 与 Service 必须在同一 namespace，否则 Endpoints 为空。
权限问题优先排查：Prometheus Operator 需要 list nodes 等集群级权限，否则无法创建目标对象。
一键排查脚本可复用：通过自动化检查脚本，可快速定位监控链路故障。
重装比手动修复更高效：当监控组件资源错乱、命名空间混乱时，直接卸载重装往往是最快捷的方案。

结语

WhizardTelemetry 作为 KubeSphere 可观测平台的重要组件，其架构依赖多组件协作（Prometheus、Alertmanager、Thanos 等）。当监控 API 出现 502 错误时，务必先快速定位是 网络/Service 问题 还是 资源/权限问题，必要时通过卸载重装恢复一致性。