乐闻世界logo
搜索文章和话题

How to troubleshoot common Prometheus failures?

2月21日 15:37

Prometheus troubleshooting and common issue resolution:

Prometheus Fails to Start:

  1. Check configuration file syntax:
bash
promtool check config /etc/prometheus/prometheus.yml
  1. Check port usage:
bash
lsof -i :9090
  1. View logs:
bash
journalctl -u prometheus -f

Data Collection Failure:

  1. Check target health status:
promql
up{job="your-job"}
  1. Check network connectivity:
bash
curl http://target:port/metrics
  1. Check authentication configuration:
  • Basic Auth username and password
  • TLS certificate validity
  • Bearer Token expiration

Query Performance Issues:

  1. Optimize query statements:
  • Reduce time range
  • Use label filtering
  • Avoid full scans
  1. Use Recording Rules:
yaml
groups: - name: performance rules: - record: job:requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
  1. Monitor query performance:
promql
prometheus_query_duration_seconds_sum prometheus_query_duration_seconds_count

High Memory Usage:

  1. Adjust data retention time:
yaml
storage: tsdb: retention.time: 7d
  1. Filter unnecessary metrics:
yaml
metric_relabel_configs: - source_labels: [__name__] regex: 'expensive_.*' action: drop
  1. Monitor memory metrics:
promql
process_resident_memory_bytes prometheus_tsdb_memory_series

Disk Space Insufficient:

  1. Check data size:
bash
du -sh /var/lib/prometheus/
  1. Configure data retention policy:
yaml
storage: tsdb: retention.time: 15d retention.size: 10GB
  1. Clean old data:
bash
promtool tsdb delete-blocks /var/lib/prometheus/ --min-time=2024-01-01T00:00:00Z

Alerts Not Triggering:

  1. Check alert rules:
bash
promtool check rules /etc/prometheus/rules/*.yml
  1. Check alert status:
promql
ALERTS{alertname="your-alert"}
  1. Check Alertmanager connection:
promql
prometheus_notifications_queue_length

TSDB Corruption:

  1. Check TSDB health:
bash
promtool tsdb analyze /var/lib/prometheus/
  1. Attempt recovery:
bash
promtool tsdb repair /var/lib/prometheus/
  1. Backup and restore:
bash
# Backup promtool tsdb snapshot /var/lib/prometheus/ /backup/ # Restore cp -r /backup/* /var/lib/prometheus/

Common Error Codes:

  • context deadline exceeded: Query timeout
  • invalid parameter: Parameter error
  • out of order sample: Out-of-order samples
  • duplicate series: Duplicate series

Best Practices:

  1. Regularly backup configuration and data
  2. Monitor Prometheus self-metrics
  3. Set reasonable resource limits
  4. Use log aggregation tools
  5. Establish troubleshooting procedures
标签:Prometheus