What is CDN origin pull? How to reduce CDN origin pull? - 面试题

Concept of CDN Origin Pull

Origin pull refers to the process where a CDN edge node requests content from the origin server when it doesn't have the requested content cached. Origin pull is an important part of the CDN mechanism, directly affecting CDN performance and origin server load.

Origin Pull Trigger Conditions

1. Cache Miss

This is the most common reason for origin pull:

First access: Content has never been cached before
Cache expired: Content has exceeded TTL (Time To Live)
Cache cleared: Actively refreshed or passively cleared
Cache key mismatch: Request parameter changes cause different cache keys

2. Special Request Types

Certain request types force origin pull:

POST requests: Usually not cached, direct origin pull
Requests with specific headers: Like Authorization, Cookie, etc.
Dynamic content: Content not cached based on business rules

3. Cache Strategy Configuration

Decide whether to pull from origin based on configuration:

Non-cached paths: URL paths configured as non-cached
Specific users: Like logged-in users, VIP users, etc.
Specific time periods: Like real-time data needed during events

Impact of Origin Pull on Performance

1. Increased Latency

Origin pull requests go through the complete network path:

User → Edge node: Usually <50ms
Edge node → Origin: Possibly 100-500ms
Origin → Edge node → User: Round-trip time accumulates

Total latency: <50ms when cache hits, 200-1000ms when origin pull

2. Increased Origin Load

Origin pull requests hit the origin server directly:

Bandwidth consumption: All origin pull requests consume origin bandwidth
Server pressure: Increases origin CPU, memory, database pressure
Concurrency limits: May trigger origin server concurrency limits

3. Increased Cost

Bandwidth cost: CDN origin pull bandwidth usually requires payment
Origin cost: May need to upgrade origin server configuration
Traffic cost: Additional fees for exceeding quotas

Strategies to Reduce Origin Pull

1. Optimize Caching Strategy

Set TTL Reasonably

http
// Static resources: Long TTL
Cache-Control: public, max-age=31536000, immutable

// Dynamic content: Short TTL
Cache-Control: public, max-age=60

// Non-cached content
Cache-Control: no-store

Use Versioning

Avoid origin pull through URL versioning:

shell
// Not recommended: Need to clear cache after update
style.css

// Recommended: Change URL when updating
style.v1.css
style.v2.css

2. Cache Warming

Actively push to CDN before content release:

Warming timing: 1-2 hours before content release
Warming content: Content expected to be accessed frequently
Warming method: Through CDN API or management console

Example:

bash
# Warm up specific URL
curl -X POST "https://api.cdn.com/prefetch" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com/image.jpg"]}'

3. Configure Ignore Parameters

Ignore query parameters that don't affect content:

shell
// Configure to ignore timestamp parameter
https://example.com/data?timestamp=123456
https://example.com/data?timestamp=789012
// These two requests will hit the same cache

4. Use Edge Computing

Process simple logic at CDN edge nodes:

Request routing: Return different content based on user type
Simple calculations: Like timestamp conversion, formatting, etc.
A/B testing: Assign test groups at edge nodes

5. Hierarchical Caching

Utilize CDN's multi-level caching architecture:

Edge cache: First level, small capacity but fast response
Regional cache: Second level, medium capacity
Origin cache: Last level, largest capacity

Advantage: Even if edge cache misses, regional cache may hit

Origin Pull Optimization Techniques

1. Compressed Transmission

Reduce data transfer during origin pull:

http
// Enable compression
Accept-Encoding: gzip, deflate, br

// Origin responds with compressed content
Content-Encoding: gzip

Effect: Text content can reduce 60-80% transfer volume

2. Use HTTP/2 or HTTP/3

Leverage advantages of new protocols:

HTTP/2: Multiplexing, reduce number of connections
HTTP/3: Based on UDP, reduce connection establishment time

3. Optimize Origin Performance

Ensure origin can respond quickly:

Database optimization: Add indexes, optimize queries
Cache layer: Use Redis, Memcached
Load balancing: Multiple origin servers share load

4. Monitor Origin Pull

Real-time monitoring of origin pull metrics:

Origin pull rate: Ratio of origin pull requests to total requests
Origin pull latency: Average response time of origin pull requests
Origin pull bandwidth: Bandwidth consumed by origin pull traffic

Goals: Origin pull rate <10%, origin pull latency <500ms

Common Origin Pull Issues

Issue 1: High Origin Pull Rate

Cause analysis:

TTL set too short
Improper cache key configuration
Many dynamic requests

Solutions:

Extend TTL for static resources
Optimize cache key configuration
Implement edge computing for dynamic content

Issue 2: High Origin Pull Latency

Cause analysis:

Poor origin performance
Long network distance
High origin load

Solutions:

Optimize origin performance
Use nearest origin node
Implement origin load balancing

Issue 3: High Origin Pull Bandwidth Cost

Cause analysis:

Many large file origin pulls
Compression not enabled
High origin pull rate

Solutions:

Implement cache warming for large files
Enable compressed transmission
Reduce origin pull rate

Interview Points

When answering this question, emphasize:

Clear understanding of origin pull concept and trigger conditions
Understanding of origin pull's impact on performance and cost
Mastery of multiple strategies to reduce origin pull
Practical optimization experience and case studies
Ability to analyze origin pull metrics and propose improvement plans