How to troubleshoot common Prometheus failures? - 面试题

Prometheus troubleshooting and common issue resolution:

Prometheus Fails to Start:

bash
promtool check config /etc/prometheus/prometheus.yml

bash
lsof -i :9090

bash
journalctl -u prometheus -f

Data Collection Failure:

promql
up{job="your-job"}

bash
curl http://target:port/metrics

Query Performance Issues:

yaml
groups:
  - name: performance
    rules:
      - record: job:requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

promql
prometheus_query_duration_seconds_sum
prometheus_query_duration_seconds_count

High Memory Usage:

yaml
storage:
  tsdb:
    retention.time: 7d

yaml
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'expensive_.*'
    action: drop

promql
process_resident_memory_bytes
prometheus_tsdb_memory_series

Disk Space Insufficient:

bash
du -sh /var/lib/prometheus/

yaml
storage:
  tsdb:
    retention.time: 15d
    retention.size: 10GB

bash
promtool tsdb delete-blocks /var/lib/prometheus/ --min-time=2024-01-01T00:00:00Z

Alerts Not Triggering:

bash
promtool check rules /etc/prometheus/rules/*.yml

promql
ALERTS{alertname="your-alert"}

promql
prometheus_notifications_queue_length

TSDB Corruption:

bash
promtool tsdb analyze /var/lib/prometheus/

bash
promtool tsdb repair /var/lib/prometheus/

bash
# Backup
promtool tsdb snapshot /var/lib/prometheus/ /backup/

# Restore
cp -r /backup/* /var/lib/prometheus/

Common Error Codes:

Best Practices: