Prometheus 告警配置和 Alertmanager 使用:
告警规则配置:
yamlgroups: - name: example_alerts rules: - alert: HighCPUUsage expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
关键字段:
expr:告警表达式for:持续满足条件的时间labels:告警标签annotations:告警描述
Alertmanager 配置:
yamlroute: group_by: ['alertname', 'cluster'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' receivers: - name: 'default' email_configs: - to: 'alert@example.com' from: 'prometheus@example.com' webhook_configs: - url: 'http://webhook.example.com/alert'
告警分组:
group_by:按标签分组group_wait:等待时间,合并同组告警group_interval:组内告警间隔repeat_interval:重复通知间隔
告警抑制:
yamlinhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
告警静默:
- 通过 API 创建静默规则
- 支持时间范围和匹配器
- 适用于维护窗口
最佳实践:
- 合理设置告警阈值,避免告警疲劳
- 使用分级告警(info、warning、critical)
- 定期审查和优化告警规则
- 结合 Grafana 进行可视化告警