如何排查 Prometheus 的常见故障？ - 面试题

Prometheus 故障排查和常见问题解决：

Prometheus 无法启动：

bash
promtool check config /etc/prometheus/prometheus.yml

bash
lsof -i :9090

bash
journalctl -u prometheus -f

数据采集失败：

promql
up{job="your-job"}

bash
curl http://target:port/metrics

查询性能问题：

yaml
groups:
  - name: performance
    rules:
      - record: job:requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

promql
prometheus_query_duration_seconds_sum
prometheus_query_duration_seconds_count

内存使用过高：

yaml
storage:
  tsdb:
    retention.time: 7d

yaml
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'expensive_.*'
    action: drop

promql
process_resident_memory_bytes
prometheus_tsdb_memory_series

磁盘空间不足：

bash
du -sh /var/lib/prometheus/

yaml
storage:
  tsdb:
    retention.time: 15d
    retention.size: 10GB

bash
promtool tsdb delete-blocks /var/lib/prometheus/ --min-time=2024-01-01T00:00:00Z

告警不触发：

bash
promtool check rules /etc/prometheus/rules/*.yml

promql
ALERTS{alertname="your-alert"}

promql
prometheus_notifications_queue_length

TSDB 损坏：

bash
promtool tsdb analyze /var/lib/prometheus/

bash
promtool tsdb repair /var/lib/prometheus/

bash
# 备份
promtool tsdb snapshot /var/lib/prometheus/ /backup/

# 恢复
cp -r /backup/* /var/lib/prometheus/

常见错误代码：

最佳实践：