乐闻世界logo
搜索文章和话题

What is distributed tracing? What are the mainstream tracing tools? How do they work?

2月22日 14:03

Distributed tracing is an important tool for quickly locating problems and analyzing performance in distributed systems, capable of tracking the call chain of requests across multiple services:

Core Concepts:

1. Trace

  • A complete request call chain
  • The entire process from client initiating request to final response
  • Contains multiple Spans

2. Span

  • A specific call operation
  • Includes start time, end time, operation name, etc.
  • Spans form a call tree through parent-child relationships

3. Span ID

  • Uniquely identifies a Span
  • Used to build the call chain

4. Trace ID

  • Uniquely identifies a complete trace
  • All related Spans share the same Trace ID

5. Parent Span ID

  • Identifies the parent Span of the current Span
  • Used to build call hierarchy

6. Annotation

  • Records timestamps of key events
  • Such as CS (Client Send), SR (Server Receive), SS (Server Send), CR (Client Receive)

7. Baggage

  • Key-value data passed along the call chain
  • Used to pass context information between services

Main Tracing Tools:

1. Zipkin

  • Features: Open-sourced by Twitter, based on Google Dapper paper
  • Advantages:
    • Mature and stable, active community
    • Supports multiple languages
    • Friendly visualization interface
  • Disadvantages:
    • Average storage performance
    • Relatively simple functionality
  • Applicable Scenarios: Small and medium distributed systems

2. Jaeger

  • Features: Open-sourced by Uber, compatible with Zipkin API
  • Advantages:
    • Excellent performance, supports high concurrency
    • Supports multiple storage backends
    • More complete functionality
  • Disadvantages:
    • Relatively new
  • Applicable Scenarios: Distributed systems with high performance requirements

3. SkyWalking

  • Features: Domestic open source, focused on APM
  • Advantages:
    • Comprehensive features (tracing, performance monitoring, log analysis)
    • Good Java support
    • Complete Chinese documentation
  • Disadvantages:
    • Relatively weak support for other languages
  • Applicable Scenarios: Microservice architecture mainly using Java

4. Pinpoint

  • Features: Open-sourced by Naver, focused on Java
  • Advantages:
    • No code intrusion
    • Detailed performance analysis
  • Disadvantages:
    • Only supports Java
    • High resource usage
  • Applicable Scenarios: Java single-language environment

5. OpenTelemetry

  • Features: Hosted by CNCF, unified observability standard
  • Advantages:
    • Unified API and SDK
    • Multi-language support
    • Compatible with multiple backends
  • Disadvantages:
    • Relatively new, ecosystem still developing
  • Applicable Scenarios: Projects requiring unified observability standards

Implementation Principles:

1. Context Propagation

  • Pass Trace ID and Span ID during service calls
  • Pass through HTTP headers, RPC metadata, etc.
  • Example:
    java
    // gRPC context propagation Context ctx = Context.current().withValue(TRACE_ID_KEY, traceId); stub.withDeadlineAfter(timeout, TimeUnit.MILLISECONDS) .sayHello(request, ctx);

2. Interceptor/Filter

  • Intercept at request entry and exit
  • Record call start and end times
  • Example:
    java
    @Component public class TraceInterceptor implements HandlerInterceptor { @Override public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) { String traceId = generateTraceId(); MDC.put("traceId", traceId); return true; } @Override public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) { MDC.remove("traceId"); } }

3. Sampling Strategy

  • Fixed Sampling Rate: Sample at a fixed proportion
  • Dynamic Sampling: Dynamically adjust based on request characteristics
  • Error Priority: Prioritize sampling error requests

4. Data Reporting

  • Asynchronous reporting to avoid affecting business performance
  • Support batch reporting to reduce network overhead
  • Support multiple transport protocols (HTTP, gRPC, Kafka)

Spring Cloud Sleath Integration Example:

java
@SpringBootApplication @EnableZipkinServer public class ZipkinServerApplication { public static void main(String[] args) { SpringApplication.run(ZipkinServerApplication.class, args); } } // Client configuration spring: zipkin: base-url: http://localhost:9411 sleuth: sampler: probability: 0.1 # 10% sampling rate

Use Cases:

1. Performance Analysis

  • Identify slow queries and slow services
  • Analyze performance bottlenecks in call chains
  • Optimize system performance

2. Troubleshooting

  • Quickly locate problematic services
  • Track error propagation paths
  • Analyze root causes of failures

3. Dependency Analysis

  • Understand service dependencies
  • Identify unreasonable calls
  • Optimize service architecture

4. Capacity Planning

  • Analyze system load distribution
  • Predict resource requirements
  • Optimize resource allocation

Best Practices:

  • Reasonably set sampling rate to balance performance and observability
  • Combine with logs and monitoring to form a complete observability system
  • Regularly analyze trace data to optimize system performance
  • Use unified Trace ID for convenient cross-system tracing
  • Pay attention to sensitive information protection, avoid passing sensitive data in traces
标签:RPC