k6 集成到完整监控系统
在现代化的性能测试实践中,将测试数据与监控系统深度集成已成为标准配置。一个完善的监控系统不仅能够实时展示测试指标,还能进行历史趋势分析、异常检测、智能告警,并与 DevOps 工具链无缝衔接。本章将深入探讨如何构建企业级的 k6 监控体系,涵盖架构设计、技术选型、实施细节和最佳实践。
监控系统架构设计
整体架构思考
在设计 k6 监控系统时,需要考虑以下几个关键维度:
数据流架构:
- 数据采集层:k6 实时产生的性能指标数据
- 数据传输层:高效可靠的数据传输机制
- 数据存储层:时序数据库存储历史数据
- 数据展示层:可视化仪表盘和报表
- 告警层:基于规则的智能告警系统
典型架构模式:
技术选型考量:
| 方案 | 优势 | 劣势 | 适用场景 |
|---|---|---|---|
| InfluxDB + Grafana | 成熟稳定、社区活跃、易上手 | 水平扩展能力有限 | 中小规模测试 |
| Prometheus + Grafana | 云原生、拉取模型、K8s集成好 | 长期存储成本高 | 容器化环境 |
| Datadog | 全托管、功能强大、UI精美 | 成本较高 | 企业级需求 |
| CloudWatch | AWS原生、集成简单 | 功能相对简单 | AWS环境 |
| k6 Cloud | 官方支持、分布式测试 | 依赖云服务 | 快速启动 |
数据模型设计
理解 k6 的数据模型对于构建高效的监控系统至关重要。k6 产生的核心指标包括:
HTTP 相关指标:
http_reqs:请求总数(Counter)http_req_duration:请求耗时(Trend)http_req_blocked:建立连接等待时间(Trend)http_req_connecting:TCP 连接建立时间(Trend)http_req_tls_handshaking:TLS 握手时间(Trend)http_req_sending:发送请求时间(Trend)http_req_waiting:等待响应时间(Trend)http_req_receiving:接收响应时间(Trend)http_req_failed:失败率(Rate)
WebSocket 相关指标:
ws_connecting:WebSocket 连接时间ws_sessions:活跃会话数ws_msgs_sent:发送消息数ws_msgs_received:接收消息数
自定义指标:
- Counter:累加计数器
- Gauge:瞬时值
- Rate:比率/百分比
- Trend:统计分布(均值、中位数、P95、P99等)
InfluxDB 深度集成
InfluxDB 是一个专为时序数据设计的数据库,其列式存储和高效压缩算法使其成为性能测试数据的理想选择。
InfluxDB 环境搭建
生产级部署(Docker Compose):
version: '3.8'
services:
influxdb:
image: influxdb:2.7
container_name: k6-influxdb
ports:
- "8086:8086"
environment:
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_USERNAME=admin
- DOCKER_INFLUXDB_INIT_PASSWORD=admin123456
- DOCKER_INFLUXDB_INIT_ORG=k6-org
- DOCKER_INFLUXDB_INIT_BUCKET=k6
- DOCKER_INFLUXDB_INIT_RETENTION=30d
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=my-super-secret-auth-token
volumes:
- influxdb-data:/var/lib/influxdb2
- influxdb-config:/etc/influxdb2
networks:
- monitoring
restart: unless-stopped
volumes:
influxdb-data:
influxdb-config:
networks:
monitoring:
driver: bridge
启动服务:
docker-compose up -d
k6 集成配置
方式一:命令行参数:
# InfluxDB v1.x
k6 run --out influxdb=http://localhost:8086/k6 script.js
# InfluxDB v2.x
k6 run --out influxdb=http://localhost:8086 \
-e K6_INFLUXDB_ORGANIZATION=k6-org \
-e K6_INFLUXDB_BUCKET=k6 \
-e K6_INFLUXDB_TOKEN=my-super-secret-auth-token \
script.js
方式二:环境变量配置:
# 创建配置文件 .env
cat > .env << EOF
K6_INFLUXDB_ORGANIZATION=k6-org
K6_INFLUXDB_BUCKET=k6
K6_INFLUXDB_TOKEN=my-super-secret-auth-token
K6_INFLUXDB_INSECURE=false
K6_INFLUXDB_PUSH_INTERVAL=1s
K6_INFLUXDB_CONCURRENT_WRITES=10
EOF
# 加载环境变量并运行
export $(cat .env | xargs)
k6 run --out influxdb=http://localhost:8086 script.js
方式三:脚本内配置(推荐):
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Counter, Rate, Trend } from 'k6/metrics';
// 自定义指标
const apiErrors = new Counter('api_errors');
const apiLatency = new Trend('api_latency', true);
const successRate = new Rate('success_rate');
export const options = {
// 测试配置
stages: [
{ duration: '1m', target: 50 },
{ duration: '3m', target: 50 },
{ duration: '1m', target: 0 },
],
// InfluxDB 输出配置
ext: {
loadimpact: {
projectID: 3481195,
name: 'API Performance Test'
}
},
// 阈值配置
thresholds: {
http_req_duration: ['p(95)<500', 'p(99)<1000'],
http_req_failed: ['rate<0.01'],
api_latency: ['avg<300', 'p(95)<500'],
success_rate: ['rate>0.99'],
},
// 标签配置
tags: {
environment: __ENV.ENVIRONMENT || 'staging',
team: 'platform',
service: 'user-api',
},
};
export default function () {
const startTime = Date.now();
// 执行 API 请求
const response = http.get('https://api.example.com/users', {
tags: {
endpoint: '/users',
method: 'GET',
},
});
const duration = Date.now() - startTime;
// 记录自定义指标
apiLatency.add(duration);
// 检查响应
const success = check(response, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': () => duration < 500,
'body has data': (r) => r.json('data') !== null,
});
successRate.add(success);
if (!success) {
apiErrors.add(1);
}
sleep(1);
}
// 测试结束回调
export function handleSummary(data) {
return {
'stdout': textSummary(data, { indent: ' ', enableColors: true }),
'summary.json': JSON.stringify(data),
};
}
InfluxDB 数据查询与分析
使用 Flux 查询语言(InfluxDB 2.x):
// 查询最近1小时的平均响应时间
from(bucket: "k6")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "http_req_duration")
|> filter(fn: (r) => r._field == "value")
|> aggregateWindow(every: 1m, fn: mean, createEmpty: false)
|> yield(name: "mean")
// 计算 P95 响应时间
from(bucket: "k6")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "http_req_duration")
|> filter(fn: (r) => r._field == "value")
|> aggregateWindow(every: 1m, fn: (column, tables=<-) =>
tables |> quantile(q: 0.95, column: column)
)
|> yield(name: "p95")
// 按标签分组统计
from(bucket: "k6")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "http_reqs")
|> group(columns: ["endpoint", "method"])
|> count()
|> yield(name: "requests_by_endpoint")
使用 InfluxQL(InfluxDB 1.x):
-- 查询平均响应时间
SELECT mean("value") AS "avg_duration"
FROM "http_req_duration"
WHERE time > now() - 1h
GROUP BY time(1m) fill(null)
-- 查询各个百分位数
SELECT
percentile("value", 50) AS "p50",
percentile("value", 95) AS "p95",
percentile("value", 99) AS "p99"
FROM "http_req_duration"
WHERE time > now() - 1h
GROUP BY time(1m)
-- 查询错误率
SELECT
sum("value") AS "failed_requests",
count("value") AS "total_requests",
sum("value") / count("value") * 100 AS "error_rate"
FROM "http_req_failed"
WHERE time > now() - 1h
GROUP BY time(1m)
-- 按标签查询
SELECT mean("value")
FROM "http_req_duration"
WHERE "environment" = 'production'
AND "endpoint" = '/api/users'
AND time > now() - 1h
GROUP BY time(1m)
数据保留策略与性能优化
创建保留策略:
-- 原始数据保留7天
CREATE RETENTION POLICY "k6_raw" ON "k6"
DURATION 7d REPLICATION 1 DEFAULT
-- 1分钟聚合数据保留30天
CREATE RETENTION POLICY "k6_1m" ON "k6"
DURATION 30d REPLICATION 1
-- 1小时聚合数据保留1年
CREATE RETENTION POLICY "k6_1h" ON "k6"
DURATION 365d REPLICATION 1
创建连续查询实现自动聚合:
-- 创建1分钟聚合
CREATE CONTINUOUS QUERY "cq_1m" ON "k6"
BEGIN
SELECT
mean("value") AS "avg",
min("value") AS "min",
max("value") AS "max",
percentile("value", 50) AS "p50",
percentile("value", 95) AS "p95",
percentile("value", 99) AS "p99"
INTO "k6"."k6_1m"."http_req_duration_1m"
FROM "http_req_duration"
GROUP BY time(1m), *
END
-- 创建1小时聚合
CREATE CONTINUOUS QUERY "cq_1h" ON "k6"
BEGIN
SELECT
mean("avg") AS "avg",
min("min") AS "min",
max("max") AS "max",
percentile("p95", 95) AS "p95",
percentile("p99", 99) AS "p99"
INTO "k6"."k6_1h"."http_req_duration_1h"
FROM "k6"."k6_1m"."http_req_duration_1m"
GROUP BY time(1h), *
END
性能调优参数:
# influxdb.conf
[data]
# 缓存大小
cache-max-memory-size = "1g"
cache-snapshot-memory-size = "25m"
# 压缩
compact-full-write-cold-duration = "4h"
# 写入优化
max-series-per-database = 1000000
max-values-per-tag = 100000
[coordinator]
# 写入超时
write-timeout = "10s"
# 并发查询
max-concurrent-queries = 0
# 查询超时
query-timeout = "0s"
# 最大选择点数
max-select-point = 0
Grafana 高级可视化
Grafana 是业界领先的可视化平台,提供丰富的图表类型和强大的查询能力。
Grafana 环境搭建
生产级部署(Docker Compose):
version: '3.8'
services:
grafana:
image: grafana/grafana:10.2.0
container_name: k6-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123456
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=http://localhost:3000
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
networks:
- monitoring
restart: unless-stopped
depends_on:
- influxdb
volumes:
grafana-data:
networks:
monitoring:
driver: bridge
数据源自动配置
创建数据源配置文件 grafana/provisioning/datasources/influxdb.yaml:
apiVersion: 1
datasources:
- name: InfluxDB-k6
type: influxdb
access: proxy
url: http://influxdb:8086
database: k6
isDefault: true
editable: true
jsonData:
httpMode: GET
timeInterval: 1s
secureJsonData:
password: admin123456
企业级仪表盘设计
一个专业的性能测试仪表盘应该包含以下几个关键部分:
1. 总览面板(Overview):
{
"panels": [
{
"id": 1,
"title": "测试概览",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"targets": [
{
"query": "SELECT last(\"value\") FROM \"vus\" WHERE $timeFilter",
"alias": "虚拟用户数"
},
{
"query": "SELECT sum(\"value\") FROM \"http_reqs\" WHERE $timeFilter",
"alias": "总请求数"
}
],
"options": {
"graphMode": "area",
"colorMode": "background",
"justifyMode": "center"
}
}
]
}
2. 性能指标面板(Performance Metrics):
{
"panels": [
{
"id": 2,
"title": "响应时间趋势",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"targets": [
{
"query": "SELECT mean(\"value\") FROM \"http_req_duration\" WHERE $timeFilter GROUP BY time($__interval) fill(null)",
"alias": "平均响应时间"
},
{
"query": "SELECT percentile(\"value\", 95) FROM \"http_req_duration\" WHERE $timeFilter GROUP BY time($__interval) fill(null)",
"alias": "P95响应时间"
},
{
"query": "SELECT percentile(\"value\", 99) FROM \"http_req_duration\" WHERE $timeFilter GROUP BY time($__interval) fill(null)",
"alias": "P99响应时间"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"lineWidth": 2,
"fillOpacity": 10
}
}
}
}
]
}
3. 吞吐量面板(Throughput):
{
"panels": [
{
"id": 3,
"title": "请求速率 (RPS)",
"type": "timeseries",
"targets": [
{
"query": "SELECT derivative(mean(\"value\"), 1s) FROM \"http_reqs\" WHERE $timeFilter GROUP BY time($__interval) fill(null)",
"alias": "每秒请求数"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"color": {"mode": "palette-classic"}
}
}
}
]
}
4. 错误分析面板(Error Analysis):
{
"panels": [
{
"id": 4,
"title": "错误率",
"type": "gauge",
"targets": [
{
"query": "SELECT mean(\"value\") * 100 FROM \"http_req_failed\" WHERE $timeFilter",
"alias": "错误率"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
}
]
}
5. 性能分解面板(Performance Breakdown):
{
"panels": [
{
"id": 5,
"title": "请求时间分解",
"type": "timeseries",
"targets": [
{
"query": "SELECT mean(\"value\") FROM \"http_req_blocked\" WHERE $timeFilter GROUP BY time($__interval)",
"alias": "DNS解析+连接排队"
},
{
"query": "SELECT mean(\"value\") FROM \"http_req_connecting\" WHERE $timeFilter GROUP BY time($__interval)",
"alias": "TCP连接建立"
},
{
"query": "SELECT mean(\"value\") FROM \"http_req_tls_handshaking\" WHERE $timeFilter GROUP BY time($__interval)",
"alias": "TLS握手"
},
{
"query": "SELECT mean(\"value\") FROM \"http_req_sending\" WHERE $timeFilter GROUP BY time($__interval)",
"alias": "发送请求"
},
{
"query": "SELECT mean(\"value\") FROM \"http_req_waiting\" WHERE $timeFilter GROUP BY time($__interval)",
"alias": "等待响应"
},
{
"query": "SELECT mean(\"value\") FROM \"http_req_receiving\" WHERE $timeFilter GROUP BY time($__interval)",
"alias": "接收响应"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"custom": {
"stacking": {"mode": "normal"}
}
}
}
}
]
}
变量和模板化
使用变量可以创建可复用的仪表盘:
{
"templating": {
"list": [
{
"name": "环境",
"type": "query",
"datasource": "InfluxDB-k6",
"query": "SHOW TAG VALUES WITH KEY = \"environment\"",
"multi": false,
"includeAll": false
},
{
"name": "接口",
"type": "query",
"datasource": "InfluxDB-k6",
"query": "SHOW TAG VALUES WITH KEY = \"endpoint\" WHERE \"environment\" = '$环境'",
"multi": true,
"includeAll": true
},
{
"name": "时间粒度",
"type": "interval",
"query": "10s,30s,1m,5m,10m,30m,1h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
}
}
在查询中使用变量:
SELECT mean("value")
FROM "http_req_duration"
WHERE "environment" = '$环境'
AND "endpoint" =~ /^$接口$/
AND $timeFilter
GROUP BY time($时间粒度)
告警规则配置
创建告警规则 grafana/provisioning/alerting/rules.yaml:
apiVersion: 1
groups:
- name: k6-performance-alerts
interval: 30s
rules:
- uid: high-response-time
title: 响应时间过高
condition: A
data:
- refId: A
queryType: influxdb
model:
query: SELECT mean("value") FROM "http_req_duration" WHERE $timeFilter
datasourceUid: influxdb-k6
relativeTimeRange:
from: 300
to: 0
noDataState: NoData
execErrState: Alerting
for: 2m
annotations:
description: "平均响应时间超过1000ms,当前值:{{ $values.A.Value }}"
summary: "API响应时间告警"
labels:
severity: warning
team: platform
- uid: high-error-rate
title: 错误率过高
condition: A
data:
- refId: A
queryType: influxdb
model:
query: SELECT mean("value") FROM "http_req_failed" WHERE $timeFilter
datasourceUid: influxdb-k6
for: 1m
annotations:
description: "错误率超过1%,当前值:{{ $values.A.Value | humanizePercentage }}"
labels:
severity: critical
Prometheus 云原生集成
Prometheus 采用拉取(Pull)模型,非常适合 Kubernetes 等云原生环境。
Prometheus 架构与集成
整体架构:
k6 Prometheus Remote Write
k6 支持通过 Remote Write 协议将指标推送到 Prometheus:
k6 run --out experimental-prometheus-rw script.js
环境变量配置:
export K6_PROMETHEUS_RW_SERVER_URL=http://localhost:9090/api/v1/write
export K6_PROMETHEUS_RW_PUSH_INTERVAL=5s
export K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true
完整配置示例:
import http from 'k6/http';
import { Counter, Trend, Rate } from 'k6/metrics';
// 自定义 Prometheus 指标
const apiRequestsTotal = new Counter('api_requests_total');
const apiRequestDuration = new Trend('api_request_duration_seconds');
const apiErrorRate = new Rate('api_error_rate');
export const options = {
vus: 50,
duration: '5m',
// Prometheus Remote Write 配置
ext: {
'prometheus-rw': {
url: 'http://prometheus:9090/api/v1/write',
pushInterval: '5s',
insecureSkipTLSVerify: false,
headers: {
'X-Scope-OrgID': 'tenant1',
},
},
},
};
export default function () {
const response = http.get('https://api.example.com/users');
// 记录指标
apiRequestsTotal.add(1, {
method: 'GET',
endpoint: '/users',
status: response.status,
});
apiRequestDuration.add(response.timings.duration / 1000, {
endpoint: '/users',
});
apiErrorRate.add(response.status >= 400, {
endpoint: '/users',
});
}
Prometheus 查询语言(PromQL)
基础查询:
# 查询请求速率(QPS)
rate(k6_http_reqs_total[5m])
# 查询响应时间P95
histogram_quantile(0.95, rate(k6_http_req_duration_seconds_bucket[5m]))
# 查询错误率
sum(rate(k6_http_req_failed_total[5m])) / sum(rate(k6_http_reqs_total[5m]))
# 按端点分组查询
sum by (endpoint) (rate(k6_http_reqs_total[5m]))
高级查询:
# 计算请求成功率
(1 - (
sum(rate(k6_http_req_failed_total[5m]))
/
sum(rate(k6_http_reqs_total[5m]))
)) * 100
# 响应时间异常检测(3-sigma规则)
k6_http_req_duration_seconds:p95 >
(
avg_over_time(k6_http_req_duration_seconds:p95[1h]) +
3 * stddev_over_time(k6_http_req_duration_seconds:p95[1h])
)
# 预测未来趋势
predict_linear(k6_http_req_duration_seconds:p95[30m], 3600)
Prometheus 告警规则
创建告警规则 prometheus-rules.yml:
groups:
- name: k6_performance
interval: 30s
rules:
# 记录规则:预计算常用指标
- record: k6_http_req_duration_seconds:p95
expr: histogram_quantile(0.95, rate(k6_http_req_duration_seconds_bucket[5m]))
- record: k6_http_req_duration_seconds:p99
expr: histogram_quantile(0.99, rate(k6_http_req_duration_seconds_bucket[5m]))
- record: k6_http_error_rate
expr: |
sum(rate(k6_http_req_failed_total[5m]))
/
sum(rate(k6_http_reqs_total[5m]))
# 告警规则
- alert: HighResponseTime
expr: k6_http_req_duration_seconds:p95 > 1
for: 2m
labels:
severity: warning
component: api
annotations:
summary: "API响应时间过高"
description: "P95响应时间 {{ $value | humanizeDuration }} 超过1秒"
dashboard: "http://grafana:3000/d/k6-performance"
- alert: HighErrorRate
expr: k6_http_error_rate > 0.01
for: 1m
labels:
severity: critical
component: api
annotations:
summary: "API错误率过高"
description: "错误率 {{ $value | humanizePercentage }} 超过1%"
runbook: "https://wiki.company.com/runbooks/high-error-rate"
- alert: ThroughputDrop
expr: |
(
rate(k6_http_reqs_total[5m])
<
0.8 * avg_over_time(rate(k6_http_reqs_total[5m])[30m:5m])
)
for: 5m
labels:
severity: warning
annotations:
summary: "吞吐量显著下降"
description: "当前QPS {{ $value | humanize }} 低于历史均值的80%"
AlertManager 配置
创建 AlertManager 配置 alertmanager.yml:
global:
resolve_timeout: 5m
# 通知模板
templates:
- '/etc/alertmanager/templates/*.tmpl'
# 路由配置
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
# 严重告警立即发送
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 0s
repeat_interval: 5m
# 警告级别告警
- match:
severity: warning
receiver: 'warning-alerts'
repeat_interval: 1h
# 接收器配置
receivers:
- name: 'default'
webhook_configs:
- url: 'http://alertmanager-webhook:5001/'
- name: 'critical-alerts'
# 钉钉告警
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
send_resolved: true
# PagerDuty
pagerduty_configs:
- service_key: 'YOUR_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}'
# Slack
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR_WEBHOOK'
channel: '#alerts-critical'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR_WEBHOOK'
channel: '#alerts-warning'
# 抑制规则
inhibit_rules:
# 如果已经有critical告警,则抑制warning告警
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
企业级云平台集成
Datadog 企业级监控
Datadog 提供全栈监控能力,适合需要统一监控平台的企业。
安装 Datadog Agent:
DD_API_KEY=your_api_key DD_SITE="datadoghq.com" \
bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
k6 配置:
# 使用 StatsD 输出
k6 run --out statsd script.js
# 配置环境变量
export K6_STATSD_ADDR=localhost:8125
export K6_STATSD_NAMESPACE=k6
export K6_STATSD_PUSH_INTERVAL=1s
export K6_STATSD_BUFFER_SIZE=20
export K6_STATSD_ENABLE_TAGS=true
高级脚本配置:
import http from 'k6/http';
import { check } from 'k6';
export const options = {
vus: 100,
duration: '10m',
ext: {
loadimpact: {
projectID: 3481195,
name: 'Production Load Test',
distribution: {
'amazon:us:ashburn': { loadZone: 'amazon:us:ashburn', percent: 50 },
'amazon:ie:dublin': { loadZone: 'amazon:ie:dublin', percent: 50 },
},
},
},
thresholds: {
http_req_duration: [
{ threshold: 'p(95)<500', abortOnFail: false },
{ threshold: 'p(99)<1000', abortOnFail: true },
],
http_req_failed: ['rate<0.01'],
},
};
export default function () {
const response = http.get('https://api.example.com/health', {
tags: {
name: 'HealthCheck',
endpoint: '/health',
dd_service: 'user-api',
dd_env: 'production',
dd_version: '1.2.3',
},
});
check(response, {
'is status 200': (r) => r.status === 200,
}, {
dd_check: 'api_health',
});
}
Datadog 仪表盘配置(API):
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.dashboards_api import DashboardsApi
from datadog_api_client.v1.model.dashboard import Dashboard
configuration = Configuration()
configuration.api_key['apiKeyAuth'] = 'YOUR_API_KEY'
configuration.api_key['appKeyAuth'] = 'YOUR_APP_KEY'
dashboard = Dashboard(
title='k6 Performance Dashboard',
widgets=[
{
'definition': {
'type': 'timeseries',
'requests': [
{
'q': 'avg:k6.http_req_duration{*} by {endpoint}',
'display_type': 'line',
}
],
'title': 'Response Time by Endpoint',
}
},
{
'definition': {
'type': 'query_value',
'requests': [
{
'q': 'avg:k6.http_req_failed{*}',
'aggregator': 'avg',
}
],
'title': 'Error Rate',
'precision': 2,
}
},
],
layout_type='ordered',
)
with ApiClient(configuration) as api_client:
api_instance = DashboardsApi(api_client)
response = api_instance.create_dashboard(body=dashboard)
print(response)
AWS CloudWatch 集成
对于部署在 AWS 上的应用,CloudWatch 是自然的选择。
k6 CloudWatch 输出配置:
import http from 'k6/http';
import { AWSConfig, CloudWatchClient } from 'k6/x/aws';
const awsConfig = new AWSConfig({
region: __ENV.AWS_REGION || 'us-east-1',
accessKeyId: __ENV.AWS_ACCESS_KEY_ID,
secretAccessKey: __ENV.AWS_SECRET_ACCESS_KEY,
});
const cloudwatch = new CloudWatchClient(awsConfig);
export const options = {
vus: 50,
duration: '5m',
};
export default function () {
const response = http.get('https://api.example.com/users');
// 发送自定义指标到 CloudWatch
cloudwatch.putMetricData({
namespace: 'K6/PerformanceTests',
metricData: [
{
metricName: 'ResponseTime',
value: response.timings.duration,
unit: 'Milliseconds',
dimensions: [
{ name: 'Endpoint', value: '/users' },
{ name: 'Environment', value: 'production' },
],
timestamp: new Date(),
},
{
metricName: 'RequestSuccess',
value: response.status === 200 ? 1 : 0,
unit: 'Count',
dimensions: [
{ name: 'Endpoint', value: '/users' },
],
},
],
});
}
CloudWatch Insights 查询:
# 查询平均响应时间
fields @timestamp, ResponseTime
| filter MetricName = "ResponseTime"
| stats avg(ResponseTime) as AvgResponseTime by bin(5m)
# 查询错误率
fields @timestamp, RequestSuccess
| filter MetricName = "RequestSuccess"
| stats sum(RequestSuccess) as SuccessCount, count(*) as TotalCount
by bin(5m)
| fields (TotalCount - SuccessCount) / TotalCount * 100 as ErrorRate
# 分析慢查询
fields @timestamp, ResponseTime, Endpoint
| filter ResponseTime > 1000
| sort ResponseTime desc
| limit 100
高级监控模式
分布式追踪集成
将 k6 与分布式追踪系统集成,可以深入分析请求链路。
集成 Jaeger/Zipkin:
import http from 'k6/http';
import { randomString } from 'https://jslib.k6.io/k6-utils/1.2.0/index.js';
export default function () {
// 生成追踪ID
const traceId = randomString(32, 'hex');
const spanId = randomString(16, 'hex');
// 添加追踪头
const response = http.get('https://api.example.com/users', {
headers: {
'X-B3-TraceId': traceId,
'X-B3-SpanId': spanId,
'X-B3-Sampled': '1',
},
});
console.log(`Trace ID: ${traceId}, Duration: ${response.timings.duration}ms`);
}
实时流式数据处理
使用 Kafka 作为中间层,实现实时数据处理和分析。
架构设计:
k6 Kafka 输出:
k6 run --out kafka=brokers=localhost:9092,topic=k6-metrics script.js
Kafka Consumer 示例(Python):
from kafka import KafkaConsumer
from influxdb import InfluxDBClient
import json
consumer = KafkaConsumer(
'k6-metrics',
bootstrap_servers=['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
influx_client = InfluxDBClient(host='localhost', port=8086, database='k6')
for message in consumer:
metric = message.value
# 转换为 InfluxDB 格式
point = {
"measurement": metric['type'],
"tags": metric.get('tags', {}),
"time": metric['timestamp'],
"fields": {"value": metric['value']}
}
influx_client.write_points([point])
# 实时异常检测
if metric['type'] == 'http_req_duration' and metric['value'] > 1000:
send_alert(f"High latency detected: {metric['value']}ms")
多维度分析体系
构建多维度的性能分析体系:
1. 时间维度分析:
- 实时监控(秒级)
- 短期趋势(分钟/小时)
- 长期趋势(天/周/月)
- 同比/环比分析
2. 空间维度分析:
- 地理位置分布
- 数据中心性能对比
- CDN 节点性能
- 多区域负载均衡
3. 业务维度分析:
- 核心业务流程
- 用户场景模拟
- 业务峰值时段
- 转化漏斗分析
实现示例:
import http from 'k6/http';
import { check } from 'k6';
export const options = {
scenarios: {
// 场景1:核心业务流程
core_flow: {
executor: 'ramping-vus',
startVUs: 0,
stages: [
{ duration: '5m', target: 100 },
{ duration: '10m', target: 100 },
{ duration: '5m', target: 0 },
],
tags: { scenario: 'core_flow', priority: 'high' },
},
// 场景2:辅助功能
auxiliary_flow: {
executor: 'constant-vus',
vus: 50,
duration: '20m',
tags: { scenario: 'auxiliary_flow', priority: 'medium' },
},
},
thresholds: {
// 按场景设置不同的阈值
'http_req_duration{scenario:core_flow}': ['p(95)<500'],
'http_req_duration{scenario:auxiliary_flow}': ['p(95)<1000'],
},
};
export default function () {
const scenario = __ENV.SCENARIO || 'core_flow';
// 模拟用户地理位置
const regions = ['us-east', 'eu-west', 'ap-southeast'];
const region = regions[__VU % regions.length];
const response = http.get('https://api.example.com/users', {
tags: {
region: region,
user_type: __VU % 2 === 0 ? 'premium' : 'free',
},
});
check(response, {
'status is 200': (r) => r.status === 200,
}, {
region: region,
});
}
最佳实践与优化策略
数据采集优化
1. 采样策略:
对于高并发测试,不需要记录所有数据点:
import http from 'k6/http';
export const options = {
// 设置采样率
summaryTrendStats: ['avg', 'min', 'med', 'max', 'p(95)', 'p(99)', 'count'],
// 禁用某些内置指标
noConnectionReuse: false,
noVUConnectionReuse: false,
};
// 自定义采样逻辑
let sampleCounter = 0;
const sampleRate = 10; // 每10个请求采样一次
export default function () {
const response = http.get('https://api.example.com/users');
sampleCounter++;
if (sampleCounter % sampleRate === 0) {
// 记录详细数据
console.log(JSON.stringify({
timestamp: Date.now(),
duration: response.timings.duration,
status: response.status,
size: response.body.length,
}));
}
}
2. 批量写入:
# InfluxDB 批量写入配置
export K6_INFLUXDB_PUSH_INTERVAL=5s
export K6_INFLUXDB_CONCURRENT_WRITES=10
3. 数据压缩:
在传输层启用压缩,减少网络开销:
# Prometheus Remote Write 配置
remote_write:
- url: http://prometheus:9090/api/v1/write
queue_config:
capacity: 10000
max_shards: 20
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
write_relabel_configs:
- source_labels: [__name__]
regex: 'k6_.*'
action: keep
存储优化策略
1. 分层存储:
- 热数据(7天):SSD,高性能查询
- 温数据(30天):HDD,中等性能
- 冷数据(1年):归档存储,偶尔查询
2. 数据下采样:
-- InfluxDB 下采样示例
CREATE CONTINUOUS QUERY "cq_downsample_1h" ON "k6"
BEGIN
SELECT
mean("value") AS "mean",
max("value") AS "max",
min("value") AS "min"
INTO "k6"."downsampled_1h"."http_req_duration"
FROM "http_req_duration"
GROUP BY time(1h), *
END
3. 数据清理策略:
#!/bin/bash
# cleanup-old-data.sh
# 删除30天前的原始数据
influx -database k6 -execute "
DELETE FROM http_req_duration WHERE time < now() - 30d
"
# 删除1年前的聚合数据
influx -database k6 -execute "
DROP SERIES FROM downsampled_1h WHERE time < now() - 365d
"
安全性考虑
1. 认证和授权:
# Grafana 配置
[auth]
disable_login_form = false
oauth_auto_login = true
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
[users]
allow_sign_up = false
auto_assign_org = true
auto_assign_org_role = Viewer
2. 数据加密:
# InfluxDB 配置
[http]
https-enabled = true
https-certificate = "/etc/ssl/certs/influxdb.pem"
https-private-key = "/etc/ssl/certs/influxdb-key.pem"
[data]
# 静态数据加密
encryption-enabled = true
3. 网络隔离:
# Docker Compose 网络隔离
version: '3.8'
networks:
monitoring:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
internal:
driver: bridge
internal: true # 不允许外部访问
services:
influxdb:
networks:
- internal
- monitoring
grafana:
networks:
- monitoring
ports:
- "3000:3000"
高可用部署
InfluxDB 集群部署:
version: '3.8'
services:
influxdb-1:
image: influxdb:2.7
environment:
- INFLUXDB_META_DIR=/var/lib/influxdb/meta
- INFLUXDB_DATA_DIR=/var/lib/influxdb/data
volumes:
- influxdb-1-data:/var/lib/influxdb
networks:
- influxdb-cluster
influxdb-2:
image: influxdb:2.7
environment:
- INFLUXDB_META_DIR=/var/lib/influxdb/meta
- INFLUXDB_DATA_DIR=/var/lib/influxdb/data
volumes:
- influxdb-2-data:/var/lib/influxdb
networks:
- influxdb-cluster
haproxy:
image: haproxy:latest
ports:
- "8086:8086"
volumes:
- ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
depends_on:
- influxdb-1
- influxdb-2
networks:
- influxdb-cluster
networks:
influxdb-cluster:
driver: bridge
HAProxy 配置 haproxy.cfg:
global
maxconn 4096
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend influxdb_frontend
bind *:8086
default_backend influxdb_backend
backend influxdb_backend
balance roundrobin
option httpchk GET /health
server influxdb-1 influxdb-1:8086 check
server influxdb-2 influxdb-2:8086 check backup
故障排查指南
常见问题诊断:
#!/bin/bash
# diagnose-k6-monitoring.sh
echo "=== 检查 InfluxDB 连接 ==="
curl -I http://localhost:8086/health
echo "=== 检查最近的数据写入 ==="
influx -database k6 -execute "
SELECT count(*) FROM http_req_duration
WHERE time > now() - 5m
GROUP BY time(1m)
"
echo "=== 检查 Grafana 数据源 ==="
curl -u admin:admin http://localhost:3000/api/datasources
echo "=== 检查磁盘使用情况 ==="
du -sh /var/lib/influxdb/*
echo "=== 检查网络连接 ==="
netstat -an | grep :8086
echo "=== 查看 k6 输出日志 ==="
k6 run --out influxdb=http://localhost:8086/k6 \
--log-output=file=k6-debug.log \
--verbose \
script.js
1449

被折叠的 条评论
为什么被折叠?



