如何优化 Prometheus 的性能和实现水平扩展？ - 面试题

Prometheus 性能优化和扩展方案：

采集优化：

调整采集间隔：

yaml
scrape_configs:
  - job_name: 'critical'
    scrape_interval: 15s
  - job_name: 'normal'
    scrape_interval: 30s
  - job_name: 'low-priority'
    scrape_interval: 60s

过滤不需要的指标：

yaml
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_.*|process_.*'
    action: drop

使用抓取限制：

yaml
scrape_configs:
  - job_name: 'api'
    sample_limit: 10000
    target_limit: 100

存储优化：

压缩配置：

yaml
storage:
  tsdb:
    compression: zstd

内存映射优化：

bash
--storage.tsdb.memory-map-on-write=true
--storage.tsdb.max-block-duration=2h

WAL 优化：

bash
--storage.tsdb.wal-compression=true
--storage.tsdb.wal-segment-size=20MB

查询优化：

使用 Recording Rules：

yaml
groups:
  - name: precompute
    interval: 30s
    rules:
      - record: job:qps:5m
        expr: sum by (job) (rate(http_requests_total[5m]))

查询分片：

yaml
query:
  max_samples: 50000000
  timeout: 2m
  parallelism: 16

缓存配置：

yaml
query:
  lookback-delta: 5m

水平扩展方案：

Thanos 方案：

Sidecar 模式：每个 Prometheus 附加 Sidecar
Store Gateway：长期存储
Query Frontend：查询缓存和分片
Querier：分布式查询

Cortex 方案：

完全分布式架构
支持多租户
无限水平扩展
对象存储后端

VictoriaMetrics 方案：

单节点或集群模式
高性能压缩
兼容 Prometheus API
资源占用低

联邦架构：

yaml
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ['prometheus-1:9090', 'prometheus-2:9090']

资源限制：

yaml
resources:
  requests:
    memory: "4Gi"
    cpu: "2"
  limits:
    memory: "8Gi"
    cpu: "4"

监控性能指标：

promql
# 采集性能
rate(prometheus_tsdb_head_samples_appended_total[5m])
rate(prometheus_scrape_samples_post_metric_relabeling_total[5m])

# 查询性能
prometheus_query_duration_seconds_sum
prometheus_query_duration_seconds_count

# 存储性能
prometheus_tsdb_compaction_duration
prometheus_tsdb_retention_limit_bytes

最佳实践：

根据业务重要性设置不同的采集间隔
使用 Recording Rules 预计算常用查询
定期清理不需要的指标
监控 Prometheus 自身的性能指标
合理设置资源限制
使用外部存储进行长期保留
考虑使用 Thanos 或 Cortex 进行水平扩展