What are the causes of Kafka message loss? How to solve it? - 面试题

Kafka Message Loss Causes and Solutions

Kafka is designed with multiple mechanisms to prevent message loss, but in practical applications, message loss can still occur. Understanding these causes and solutions is crucial for building reliable systems.

Common Causes of Message Loss

1. Producer Side Loss

Network Issues: Network interruption during message sending
Asynchronous Sending: Producer returns before the message is actually sent when using async mode
No Retry Mechanism: No retry after send failure
Buffer Overflow: Message accumulation causes buffer overflow, messages are dropped

2. Broker Side Loss

Not Flushed to Disk: Message written to memory but not flushed to disk before crash
Insufficient Replicas: Replication factor set to 1, message lost when Broker crashes
Replica Sync Delay: Leader receives message but crashes before syncing to Followers
Disk Failure: Physical disk damage causes data loss

3. Consumer Side Loss

Auto Commit Offset: Offset committed before message processing completes
Processing Failure: Message processing fails but Offset already committed
Abnormal Exit: Consumer exits abnormally, causing uncommitted messages to be re-consumed

Solutions

Producer Side Configuration

properties
# Set retry count
retries=3

# Set acknowledgment level
acks=all  # Leader and all Followers in ISR acknowledge

# Enable idempotence
enable.idempotence=true

# Set buffer size
buffer.memory=33554432

# Set batch send size
batch.size=16384

Broker Side Configuration

properties
# Set replica count
default.replication.factor=3

# Set minimum in-sync replicas
min.insync.replicas=2

# Set flush policy
log.flush.interval.messages=10000
log.flush.interval.ms=1000

# Enable replica failure detection
replica.lag.time.max.ms=30000

Consumer Side Configuration

properties
# Disable auto commit
enable.auto.commit=false

# Manual commit Offset
# Commit after message processing completes
consumer.commitSync()

# Set reasonable timeout
session.timeout.ms=30000

Best Practices

Reasonably Set acks Parameter
- acks=0: No acknowledgment, highest performance but possible loss
- acks=1: Wait for Leader acknowledgment, balance performance and reliability
- acks=all: Wait for all ISR replicas to acknowledge, most reliable but lowest performance
Use Transactions
- Enable Producer transaction support
- Ensure messages either all succeed or all fail
Monitoring and Alerting
- Monitor message backlog
- Monitor Consumer Lag
- Set up reasonable alert mechanisms
Regular Backup
- Regularly backup Kafka data
- Establish disaster recovery plans
Testing and Verification
- Conduct fault simulation tests
- Verify the effectiveness of message loss prevention mechanisms

Trade-off Between Performance and Reliability

High reliability configurations reduce performance
Need to choose appropriate configurations based on business scenarios
For critical business data, prioritize reliability
For non-critical data, can appropriately sacrifice reliability for performance

Through reasonable configuration and monitoring, Kafka message loss can be effectively avoided in most scenarios.