What are CDN load balancing strategies? How to achieve CDN high availability? - 面试题

Concept of CDN Load Balancing

CDN load balancing refers to the mechanism of intelligently distributing user requests across multiple CDN edge nodes to optimize performance, improve availability, and ensure stability. It is one of the core components of the CDN system, directly affecting user experience and system reliability.

Load Balancing Strategies

1. Geographic Routing

Select the nearest node based on user's geographic location:

How it works:

Determine user location through DNS or IP geolocation
Select the nearest available node to the user
Consider network latency and path quality

Advantages:

Reduce network latency
Improve user experience
Lower cross-region bandwidth costs

Implementation:

nginx
# Nginx GeoIP module example
geo $geo {
    default default;
    1.0.0.0/8 us-east;
    2.0.0.0/8 us-west;
}

upstream cdn_us_east {
    server cdn-us-east-1.example.com;
}

upstream cdn_us_west {
    server cdn-us-west-1.example.com;
}

2. Proximity-based Routing

Select optimal node based on network latency:

Evaluation metrics:

RTT (Round Trip Time): Round-trip time
Packet loss rate: Network quality
Bandwidth utilization: Node load

Algorithms:

Active probing: Regularly measure latency of each node
Passive measurement: Based on actual request response time
Hybrid measurement: Combine active and passive data

3. Round Robin

Distribute requests to nodes in sequence:

Features:

Simple to implement
Evenly distributed requests
Doesn't consider node load

Use cases:

Similar node performance
Stable request volume
Not latency-sensitive

Configuration example:

nginx
upstream cdn_nodes {
    server cdn-1.example.com;
    server cdn-2.example.com;
    server cdn-3.example.com;
}

4. Weighted Round Robin

Assign different weights based on node performance:

Weight factors:

Server performance: CPU, memory, bandwidth
Geographic location: Priority regions
Cost considerations: Lower cost nodes have higher weight

Configuration example:

nginx
upstream cdn_nodes {
    server cdn-1.example.com weight=3;  # High performance node
    server cdn-2.example.com weight=2;  # Medium performance node
    server cdn-3.example.com weight=1;  # Low performance node
}

5. Least Connections

Distribute requests to node with fewest current connections:

Advantages:

Dynamically adapt to node load
Avoid single node overload
Improve resource utilization

Use cases:

Large variation in request processing time
Uneven node performance
Need real-time load adjustment

6. Hash-based Routing

Hash distribution based on request characteristics (like IP, URL):

Hash methods:

Source IP hash: Same user accesses same node
URL hash: Same content accesses same node
Consistent hashing: Minimal impact when nodes change

Advantages:

Improve cache hit rate
Maintain session consistency
Reduce cache invalidation

Configuration example:

nginx
upstream cdn_nodes {
    ip_hash;  # IP-based hashing
    server cdn-1.example.com;
    server cdn-2.example.com;
}

Health Check Mechanisms

1. Active Health Checks

Regularly probe node status:

Check methods:

TCP check: Check if port is open
HTTP check: Send HTTP request and check response
Custom check: Execute specific health check script

Check frequency:

Healthy nodes: Every 10-30 seconds
Unhealthy nodes: Every 1-5 seconds
Recovering nodes: Increase check frequency

Configuration example:

nginx
upstream cdn_nodes {
    server cdn-1.example.com max_fails=3 fail_timeout=30s;
    server cdn-2.example.com max_fails=3 fail_timeout=30s;
}

2. Passive Health Checks

Judge node status based on actual request responses:

Judgment metrics:

Response time: Considered abnormal if exceeds threshold
Error rate: Considered abnormal if error rate exceeds threshold
Timeout rate: Considered abnormal if timeout rate exceeds threshold

Advantages:

Reflect real user experience
No additional probe traffic
Strong real-time capability

3. Health Check Response

Health states:

Healthy: Normally receiving requests
Unhealthy: Temporarily not receiving requests
Recovering: Gradually restoring traffic

Failover:

Automatically remove unhealthy nodes
Redistribute traffic to healthy nodes
Gradually add nodes back after recovery

Traffic Scheduling Optimization

1. Dynamic Weight Adjustment

Dynamically adjust node weights based on real-time conditions:

Adjustment factors:

Current load: CPU, memory, network usage
Response time: Average response time
Error rate: Request error ratio

Adjustment strategy:

Lower weight when load is high
Lower weight when response is slow
Lower weight when errors are frequent

2. Circuit Breaker

Trigger circuit breaker when node continuously fails:

Circuit breaker states:

Closed: Normal state
Open: Circuit breaker state, don't forward requests
Half-open: Attempting recovery state

Circuit breaker conditions:

Error rate exceeds threshold (e.g., 50%)
Response time exceeds threshold (e.g., 5 seconds)
Consecutive failures exceed threshold

Recovery strategy:

Wait for some time after circuit breaker opens
Try sending small amount of requests
Gradually restore traffic if successful

3. Rate Limiting and Degradation

Protect system from overload:

Rate limiting strategies:

Global rate limiting: Limit total requests
Node rate limiting: Limit requests per node
User rate limiting: Limit requests per user

Degradation strategies:

Static degradation: Return cached content
Dynamic degradation: Return simplified content
Reject degradation: Directly reject requests

Load Balancing Monitoring

1. Key Metrics

Performance metrics:

Response time: P50, P95, P99
Throughput: Requests per second
Error rate: Request failure ratio

Load metrics:

Node load: CPU, memory, network usage
Connection count: Current number of connections
Queue length: Number of requests waiting to be processed

Availability metrics:

Node availability: Node online time ratio
Failover count: Frequency of failover
Recovery time: Time required for node recovery

2. Alert Mechanism

Alert levels:

P1 (Critical): Node completely unavailable
P2 (Important): Node performance severely degraded
P3 (General): Node performance slightly degraded

Alert methods:

Email notification
SMS notification
Instant messaging tools
Monitoring dashboard

3. Automated Response

Auto-scaling:

Automatically add nodes when load is high
Pre-scale before predicted traffic peaks

Auto-shrinking:

Automatically reduce nodes when load is low
Save costs

Auto-recovery:

Automatically restart node when abnormal
Automatically rollback when configuration error

Common Issues and Solutions

Issue 1: Unbalanced Load

Causes:

Unreasonable weight configuration
Inaccurate health checks
Sudden traffic spikes

Solutions:

Adjust node weights
Optimize health check strategy
Add auto-scaling mechanism

Issue 2: Frequent Failover

Causes:

Health checks too sensitive
Network jitter
Unstable node performance

Solutions:

Adjust health check thresholds
Add failover delay
Optimize node performance

Issue 3: Low Cache Hit Rate

Causes:

Improper load balancing strategy
Frequent node switching
Incorrect cache key configuration

Solutions:

Use hash-based routing
Increase node stickiness
Optimize cache key configuration

Interview Points

When answering this question, emphasize:

Understanding of different load balancing strategies and their use cases
Understanding of the importance of health check mechanisms
Mastery of traffic scheduling optimization methods
Practical load balancing configuration experience
Ability to analyze load balancing metrics and propose optimization suggestions