Kafka Consumer Group Rebalance Mechanism
Consumer Group Rebalance is an important mechanism in Kafka used to reallocate Partitions when Consumer Group members change. Understanding the Rebalance mechanism is crucial for ensuring Kafka consumption stability and high availability.
Rebalance Trigger Conditions
1. Consumer Member Changes
- New Consumer Joins: New Consumer instance joins the Consumer Group
- Consumer Leaves: Consumer instance exits normally or abnormally
- Consumer Failure: Consumer crashes or network interruption
- Consumer Timeout: Consumer fails to send heartbeat for more than session.timeout.ms
2. Partition Count Changes
- Topic Partition Increases: Topic's Partition count increases
- Topic Partition Decreases: Topic's Partition count decreases
- Topic Deletion: Topic is deleted
3. Subscription Changes
- Consumer Subscribes to New Topic: Consumer starts subscribing to a new Topic
- Consumer Unsubscribes: Consumer unsubscribes from a Topic
Rebalance Process
1. Rebalance Triggered
- After trigger conditions are met, Controller detects that Rebalance is needed
- Controller notifies Group Coordinator to start Rebalance
2. Join Group Phase
- All Consumers send JoinGroup requests to Group Coordinator
- Group Coordinator selects one Consumer as Leader
- Leader Consumer is responsible for formulating the Partition allocation plan
3. Sync Group Phase
- Leader Consumer sends the allocation plan to Group Coordinator
- Group Coordinator sends the allocation plan to all Consumers
- Consumers receive the allocation plan and start consuming
4. Completion Phase
- Consumers start consuming assigned Partitions
- Rebalance process completes
Rebalance Strategies
Range Strategy (Default)
Principle: Allocate Partitions in order of Partition and Consumer order
Allocation Rules:
- Sort Partitions in numerical order
- Sort Consumers by name
- Each Consumer gets a continuous range of Partitions
Example:
shellTopic: test-topic, Partitions: 0,1,2,3,4,5 Consumers: C1, C2, C3 Allocation Result: C1: 0,1 C2: 2,3 C3: 4,5
Characteristics:
- Relatively even allocation
- May cause uneven allocation (when Partition count is not divisible by Consumer count)
RoundRobin Strategy
Principle: Round-robin allocation of Partitions
Allocation Rules:
- Merge Partitions from all Topics
- Allocate to Consumers in round-robin fashion
Example:
shellTopic1: 0,1,2 Topic2: 0,1 Consumers: C1, C2 Allocation Result: C1: Topic1-0, Topic1-2, Topic2-1 C2: Topic1-1, Topic2-0
Characteristics:
- More even allocation
- Suitable for multiple Topics
Sticky Strategy
Principle: Maintain original allocation as much as possible while ensuring even allocation
Characteristics:
- Reduce Partition movement between Consumers
- Reduce Rebalance impact
- Improve consumption continuity
CooperativeSticky Strategy
Principle: Incremental Rebalance, only reallocate affected Partitions
Characteristics:
- Reduce Stop-the-world time
- Improve system availability
- Suitable for scenarios with high continuity requirements
Rebalance Configuration
Key Parameters
properties# Session timeout session.timeout.ms=30000 # Heartbeat interval heartbeat.interval.ms=3000 # Maximum poll interval max.poll.interval.ms=300000 # Rebalance timeout max.poll.records=500
Parameter Description
- session.timeout.ms: Consumer is considered invalid if it doesn't send heartbeat for this long
- heartbeat.interval.ms: Interval at which Consumer sends heartbeats
- max.poll.interval.ms: Maximum interval between Consumer polls
- max.poll.records: Maximum number of messages returned per poll
Rebalance Problems and Solutions
1. Frequent Rebalance Triggers
Causes:
- Consumers frequently go online/offline
- Network instability causes heartbeat timeout
- Message processing takes too long
Solutions:
properties# Increase session timeout session.timeout.ms=60000 # Increase heartbeat interval heartbeat.interval.ms=5000 # Increase poll interval max.poll.interval.ms=600000
2. Long Rebalance Time
Causes:
- Too many Consumers
- Too many Partitions
- High network latency
Solutions:
- Use CooperativeSticky strategy
- Reduce Consumer count
- Optimize network configuration
3. Rebalance Causes Duplicate Message Consumption
Causes:
- Consumers may consume duplicate messages during Rebalance
- Offset not committed in time
Solutions:
properties# Disable auto commit enable.auto.commit=false # Manual commit Offset consumer.commitSync();
4. Rebalance Causes Consumption Interruption
Causes:
- Consumers stop consuming during Rebalance
- Stop-the-world time too long
Solutions:
- Use CooperativeSticky strategy
- Reduce Rebalance trigger frequency
- Optimize Rebalance configuration
Best Practices
1. Reasonably Configure Parameters
properties# Recommended configuration session.timeout.ms=30000 heartbeat.interval.ms=3000 max.poll.interval.ms=300000 max.poll.records=500
2. Choose Appropriate Rebalance Strategy
- General Scenarios: Use Range or RoundRobin
- Need Continuity: Use Sticky
- High Availability Requirements: Use CooperativeSticky
3. Monitor Rebalance
bash# View Consumer Group status kafka-consumer-groups --bootstrap-server localhost:9092 \ --describe --group my-group # View Rebalance logs tail -f /path/to/kafka/logs/server.log | grep Rebalance
4. Optimize Consumption Logic
- Avoid long processing time for single messages
- Use asynchronous processing to improve efficiency
- Reasonably set batch processing size
5. Prevent Rebalance
- Keep Consumers running stably
- Avoid frequent Consumer start/stop
- Monitor Consumer health status
Rebalance Monitoring Metrics
- RebalanceRatePerSec: Rebalance times per second
- RebalanceTotal: Total Rebalance times
- FailedRebalanceRate: Failed Rebalance ratio
- SuccessfulRebalanceRate: Successful Rebalance ratio
Through reasonable configuration and optimization of the Rebalance mechanism, the impact of Rebalance on the system can be effectively reduced, improving Kafka consumption stability and availability.