Distributed tracing is an important tool for quickly locating problems and analyzing performance in distributed systems, capable of tracking the call chain of requests across multiple services:
Core Concepts:
1. Trace
- A complete request call chain
- The entire process from client initiating request to final response
- Contains multiple Spans
2. Span
- A specific call operation
- Includes start time, end time, operation name, etc.
- Spans form a call tree through parent-child relationships
3. Span ID
- Uniquely identifies a Span
- Used to build the call chain
4. Trace ID
- Uniquely identifies a complete trace
- All related Spans share the same Trace ID
5. Parent Span ID
- Identifies the parent Span of the current Span
- Used to build call hierarchy
6. Annotation
- Records timestamps of key events
- Such as CS (Client Send), SR (Server Receive), SS (Server Send), CR (Client Receive)
7. Baggage
- Key-value data passed along the call chain
- Used to pass context information between services
Main Tracing Tools:
1. Zipkin
- Features: Open-sourced by Twitter, based on Google Dapper paper
- Advantages:
- Mature and stable, active community
- Supports multiple languages
- Friendly visualization interface
- Disadvantages:
- Average storage performance
- Relatively simple functionality
- Applicable Scenarios: Small and medium distributed systems
2. Jaeger
- Features: Open-sourced by Uber, compatible with Zipkin API
- Advantages:
- Excellent performance, supports high concurrency
- Supports multiple storage backends
- More complete functionality
- Disadvantages:
- Relatively new
- Applicable Scenarios: Distributed systems with high performance requirements
3. SkyWalking
- Features: Domestic open source, focused on APM
- Advantages:
- Comprehensive features (tracing, performance monitoring, log analysis)
- Good Java support
- Complete Chinese documentation
- Disadvantages:
- Relatively weak support for other languages
- Applicable Scenarios: Microservice architecture mainly using Java
4. Pinpoint
- Features: Open-sourced by Naver, focused on Java
- Advantages:
- No code intrusion
- Detailed performance analysis
- Disadvantages:
- Only supports Java
- High resource usage
- Applicable Scenarios: Java single-language environment
5. OpenTelemetry
- Features: Hosted by CNCF, unified observability standard
- Advantages:
- Unified API and SDK
- Multi-language support
- Compatible with multiple backends
- Disadvantages:
- Relatively new, ecosystem still developing
- Applicable Scenarios: Projects requiring unified observability standards
Implementation Principles:
1. Context Propagation
- Pass Trace ID and Span ID during service calls
- Pass through HTTP headers, RPC metadata, etc.
- Example:
java
// gRPC context propagation Context ctx = Context.current().withValue(TRACE_ID_KEY, traceId); stub.withDeadlineAfter(timeout, TimeUnit.MILLISECONDS) .sayHello(request, ctx);
2. Interceptor/Filter
- Intercept at request entry and exit
- Record call start and end times
- Example:
java
@Component public class TraceInterceptor implements HandlerInterceptor { @Override public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) { String traceId = generateTraceId(); MDC.put("traceId", traceId); return true; } @Override public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) { MDC.remove("traceId"); } }
3. Sampling Strategy
- Fixed Sampling Rate: Sample at a fixed proportion
- Dynamic Sampling: Dynamically adjust based on request characteristics
- Error Priority: Prioritize sampling error requests
4. Data Reporting
- Asynchronous reporting to avoid affecting business performance
- Support batch reporting to reduce network overhead
- Support multiple transport protocols (HTTP, gRPC, Kafka)
Spring Cloud Sleath Integration Example:
java@SpringBootApplication @EnableZipkinServer public class ZipkinServerApplication { public static void main(String[] args) { SpringApplication.run(ZipkinServerApplication.class, args); } } // Client configuration spring: zipkin: base-url: http://localhost:9411 sleuth: sampler: probability: 0.1 # 10% sampling rate
Use Cases:
1. Performance Analysis
- Identify slow queries and slow services
- Analyze performance bottlenecks in call chains
- Optimize system performance
2. Troubleshooting
- Quickly locate problematic services
- Track error propagation paths
- Analyze root causes of failures
3. Dependency Analysis
- Understand service dependencies
- Identify unreasonable calls
- Optimize service architecture
4. Capacity Planning
- Analyze system load distribution
- Predict resource requirements
- Optimize resource allocation
Best Practices:
- Reasonably set sampling rate to balance performance and observability
- Combine with logs and monitoring to form a complete observability system
- Regularly analyze trace data to optimize system performance
- Use unified Trace ID for convenient cross-system tracing
- Pay attention to sensitive information protection, avoid passing sensitive data in traces