Prometheus troubleshooting and common issue resolution:
Prometheus Fails to Start:
- Check configuration file syntax:
bashpromtool check config /etc/prometheus/prometheus.yml
- Check port usage:
bashlsof -i :9090
- View logs:
bashjournalctl -u prometheus -f
Data Collection Failure:
- Check target health status:
promqlup{job="your-job"}
- Check network connectivity:
bashcurl http://target:port/metrics
- Check authentication configuration:
- Basic Auth username and password
- TLS certificate validity
- Bearer Token expiration
Query Performance Issues:
- Optimize query statements:
- Reduce time range
- Use label filtering
- Avoid full scans
- Use Recording Rules:
yamlgroups: - name: performance rules: - record: job:requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
- Monitor query performance:
promqlprometheus_query_duration_seconds_sum prometheus_query_duration_seconds_count
High Memory Usage:
- Adjust data retention time:
yamlstorage: tsdb: retention.time: 7d
- Filter unnecessary metrics:
yamlmetric_relabel_configs: - source_labels: [__name__] regex: 'expensive_.*' action: drop
- Monitor memory metrics:
promqlprocess_resident_memory_bytes prometheus_tsdb_memory_series
Disk Space Insufficient:
- Check data size:
bashdu -sh /var/lib/prometheus/
- Configure data retention policy:
yamlstorage: tsdb: retention.time: 15d retention.size: 10GB
- Clean old data:
bashpromtool tsdb delete-blocks /var/lib/prometheus/ --min-time=2024-01-01T00:00:00Z
Alerts Not Triggering:
- Check alert rules:
bashpromtool check rules /etc/prometheus/rules/*.yml
- Check alert status:
promqlALERTS{alertname="your-alert"}
- Check Alertmanager connection:
promqlprometheus_notifications_queue_length
TSDB Corruption:
- Check TSDB health:
bashpromtool tsdb analyze /var/lib/prometheus/
- Attempt recovery:
bashpromtool tsdb repair /var/lib/prometheus/
- Backup and restore:
bash# Backup promtool tsdb snapshot /var/lib/prometheus/ /backup/ # Restore cp -r /backup/* /var/lib/prometheus/
Common Error Codes:
context deadline exceeded: Query timeoutinvalid parameter: Parameter errorout of order sample: Out-of-order samplesduplicate series: Duplicate series
Best Practices:
- Regularly backup configuration and data
- Monitor Prometheus self-metrics
- Set reasonable resource limits
- Use log aggregation tools
- Establish troubleshooting procedures