What are CDN performance monitoring metrics? How to monitor CDN performance? - 面试题

Importance of CDN Performance Monitoring

CDN performance monitoring is a critical component for ensuring CDN service quality and user experience. By monitoring various CDN performance metrics in real-time, you can detect and resolve issues promptly, optimize CDN configuration, and improve overall performance.

Core Monitoring Metrics

1. Latency Metrics

Response Time

Definition: Time from user initiating request to receiving complete response

Key metrics:

TTFB (Time to First Byte): Time to first byte
TTLB (Time to Last Byte): Time to last byte
Total response time: Complete request-response time

Target values:

Static content: <100ms
Dynamic content: <500ms
API requests: <200ms

Network Latency

Definition: Time for data to travel across the network

Measurement methods:

bash
# Measure latency using ping
ping cdn.example.com

# Measure path latency using traceroute
traceroute cdn.example.com

2. Throughput Metrics

Bandwidth Utilization

Definition: Ratio of actual bandwidth used to total bandwidth

Calculation formula:

shell
Bandwidth utilization = (Current bandwidth / Total bandwidth) × 100%

Monitoring dimensions:

Edge node bandwidth
Origin pull bandwidth
Total bandwidth utilization

Request Volume

Key metrics:

QPS (Queries Per Second): Queries per second
RPS (Requests Per Second): Requests per second (same as QPS)
Peak QPS: Highest queries per second

Monitoring example:

javascript
// Calculate queries per second
let requestCount = 0
setInterval(() => {
  console.log(`QPS: ${requestCount}`)
  requestCount = 0
}, 1000)

// Increment count for each request
function handleRequest(request) {
  requestCount++
  // Process request...
}

3. Availability Metrics

Node Availability

Definition: Ratio of time node provides service normally to total time

Calculation formula:

shell
Node availability = (Normal operation time / Total time) × 100%

Target values:

Single node: >99.9%
Overall CDN: >99.99%

Failover Time

Definition: Time from node failure to traffic switching to other nodes

Target values:

Failure detection: <5 seconds
Traffic switching: <10 seconds
Total failover: <15 seconds

4. Cache Metrics

Cache Hit Rate

Definition: Ratio of requests returned from CDN cache to total requests

Calculation formula:

shell
Cache hit rate = (Cache hit requests / Total requests) × 100%

Target values:

Static content: >95%
Dynamic content: >70%
Overall: >90%

Optimization strategies:

nginx
# Set reasonable cache time
location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
    expires 1y;
    add_header Cache-Control "public, immutable";
}

Origin Pull Rate

Definition: Ratio of requests requiring origin pull to total requests

Calculation formula:

shell
Origin pull rate = (Origin pull requests / Total requests) × 100%

Target value: <10%

5. Error Metrics

HTTP Error Rate

Definition: Ratio of requests returning 4xx/5xx status codes

Key error codes:

4xx: Client errors (e.g., 404 Not Found)
5xx: Server errors (e.g., 502 Bad Gateway)

Target value: <1%

Timeout Rate

Definition: Ratio of request timeouts

Target value: <0.1%

Monitoring Tools and Platforms

1. CDN Built-in Monitoring

Monitoring provided by mainstream CDN service providers:

Cloudflare Analytics

Features:

Real-time traffic monitoring
Request analysis
Threat detection
Performance reports

Usage example:

javascript
// Get monitoring data via API
const response = await fetch('https://api.cloudflare.com/client/v4/zones/{zone_id}/analytics/dashboard', {
  headers: {
    'Authorization': 'Bearer {api_token}'
  }
})
const data = await response.json()
console.log(data)

AWS CloudFront Metrics

Features:

Request volume statistics
Byte transfer statistics
Error rate monitoring
Latency monitoring

CloudWatch integration:

bash
# Get CloudFront metrics using AWS CLI
aws cloudwatch get-metric-statistics \
  --namespace AWS/CloudFront \
  --metric-name Requests \
  --dimensions Name=DistributionId,Value={distribution_id} \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-02-19T23:59:59Z \
  --period 3600 \
  --statistics Sum

2. Third-party Monitoring Tools

Pingdom

Features:

Website performance monitoring
Availability monitoring
Page speed testing
Alert notifications

Characteristics:

Global monitoring nodes
Detailed performance reports
Easy to use

New Relic

Features:

Application Performance Monitoring (APM)
Infrastructure monitoring
User experience monitoring
Error tracking

Characteristics:

Full-stack monitoring
Real-time data
Powerful analytics

Datadog

Features:

Infrastructure monitoring
Application performance monitoring
Log management
Security monitoring

Characteristics:

Unified platform
Powerful integration capabilities
Flexible alerting

3. Self-built Monitoring Systems

Prometheus + Grafana

Architecture:

shell
CDN → Exporter → Prometheus → Grafana

Configuration example:

Prometheus configuration (prometheus.yml):

yaml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'cdn'
    static_configs:
      - targets: ['cdn-exporter:9090']

Grafana dashboard:

json
{
  "dashboard": {
    "title": "CDN Performance Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(cdn_requests_total[5m])"
          }
        ]
      },
      {
        "title": "Cache Hit Rate",
        "targets": [
          {
            "expr": "cdn_cache_hits / cdn_requests_total * 100"
          }
        ]
      }
    ]
  }
}

ELK Stack (Elasticsearch, Logstash, Kibana)

Usage:

Log collection and analysis
Performance monitoring
Error tracking

Configuration example:

Logstash configuration (logstash.conf):

conf
input {
  file {
    path => "/var/log/cdn/access.log"
    start_position => "beginning"
  }
}

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "cdn-logs-%{+YYYY.MM.dd}"
  }
}

Monitoring Data Collection

1. Log Collection

Access log format:

nginx
log_format cdn '$remote_addr - $remote_user [$time_local] '
                '"$request" $status $body_bytes_sent '
                '"$http_referer" "$http_user_agent" '
                'rt=$request_time uct="$upstream_connect_time" '
                'uht="$upstream_header_time" urt="$upstream_response_time" '
                'cache=$upstream_cache_status';

Key fields:

request_time: Total request time
upstream_connect_time: Time to connect to upstream
upstream_header_time: Time to receive upstream response headers
upstream_response_time: Time to receive upstream response
upstream_cache_status: Cache status (HIT/MISS/BYPASS)

2. Metrics Collection

Custom metrics collection:

javascript
// Use Prometheus client library
const client = require('prom-client');

// Create metrics
const httpRequestDuration = new client.Histogram({
  name: 'cdn_http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'code']
});

// Record metrics
const end = httpRequestDuration.startTimer();
// Process request...
end({ method: 'GET', route: '/api/data', code: 200 });

3. Real-time Monitoring

WebSocket real-time push:

javascript
// Use WebSocket to push monitoring data in real-time
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
  // Periodically send monitoring data
  const interval = setInterval(() => {
    const metrics = {
      qps: getCurrentQPS(),
      latency: getAverageLatency(),
      cacheHitRate: getCacheHitRate()
    };
    ws.send(JSON.stringify(metrics));
  }, 1000);

  ws.on('close', () => {
    clearInterval(interval);
  });
});

Alerting Mechanism

1. Alert Rules

Common alert rules:

High latency alert

yaml
# Prometheus alert rules
groups:
  - name: cdn_alerts
    rules:
      - alert: HighLatency
        expr: cdn_request_duration_seconds{quantile="0.95"} > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}s"

Low cache hit rate alert

yaml
- alert: LowCacheHitRate
  expr: cdn_cache_hits / cdn_requests_total * 100 < 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Low cache hit rate"
    description: "Cache hit rate is {{ $value }}%"

High error rate alert

yaml
- alert: HighErrorRate
  expr: cdn_errors_total / cdn_requests_total * 100 > 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value }}%"

2. Alert Notifications

Notification channels:

Email notification

yaml
# Alertmanager configuration
receivers:
  - name: 'email'
    email_configs:
      - to: 'team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager'
        auth_password: 'password'

SMS notification

yaml
receivers:
  - name: 'sms'
    webhook_configs:
      - url: 'https://sms.example.com/send'
        send_resolved: true

Instant messaging tools

yaml
receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#cdn-alerts'
        username: 'CDN Alert Bot'

Performance Optimization Recommendations

1. Optimization Based on Monitoring Data

Latency optimization

Analyze request paths with high latency
Optimize caching strategies
Adjust CDN node configuration

Cache optimization

Identify content with low cache hit rate
Adjust TTL settings
Optimize cache key configuration

Bandwidth optimization

Analyze content with high bandwidth consumption
Enable compression
Optimize images and videos

2. A/B Testing

Test different configurations:

javascript
// A/B test different caching strategies
function getCacheStrategy(userId) {
  const hash = hashUserId(userId);
  if (hash % 2 === 0) {
    return 'strategy-a'; // Long cache
  } else {
    return 'strategy-b'; // Short cache
  }
}

3. Capacity Planning

Predict based on historical data:

python
# Use time series forecasting
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Load historical data
data = pd.read_csv('cdn_metrics.csv')

# Train model
model = ARIMA(data['requests'], order=(5,1,0))
model_fit = model.fit()

# Forecast next 7 days
forecast = model_fit.forecast(steps=7)
print(forecast)

Interview Points

When answering this question, emphasize:

Understanding of core CDN monitoring metrics and their target values
Mastery of mainstream monitoring tools and platforms
Ability to design monitoring data collection solutions
Understanding of the importance of alerting mechanisms
Experience in performance optimization based on monitoring data