乐闻世界logo
搜索文章和话题

What is Kafka's Consumer Group Rebalance mechanism?

2月21日 17:00

Kafka Consumer Group Rebalance Mechanism

Consumer Group Rebalance is an important mechanism in Kafka used to reallocate Partitions when Consumer Group members change. Understanding the Rebalance mechanism is crucial for ensuring Kafka consumption stability and high availability.

Rebalance Trigger Conditions

1. Consumer Member Changes

  • New Consumer Joins: New Consumer instance joins the Consumer Group
  • Consumer Leaves: Consumer instance exits normally or abnormally
  • Consumer Failure: Consumer crashes or network interruption
  • Consumer Timeout: Consumer fails to send heartbeat for more than session.timeout.ms

2. Partition Count Changes

  • Topic Partition Increases: Topic's Partition count increases
  • Topic Partition Decreases: Topic's Partition count decreases
  • Topic Deletion: Topic is deleted

3. Subscription Changes

  • Consumer Subscribes to New Topic: Consumer starts subscribing to a new Topic
  • Consumer Unsubscribes: Consumer unsubscribes from a Topic

Rebalance Process

1. Rebalance Triggered

  • After trigger conditions are met, Controller detects that Rebalance is needed
  • Controller notifies Group Coordinator to start Rebalance

2. Join Group Phase

  • All Consumers send JoinGroup requests to Group Coordinator
  • Group Coordinator selects one Consumer as Leader
  • Leader Consumer is responsible for formulating the Partition allocation plan

3. Sync Group Phase

  • Leader Consumer sends the allocation plan to Group Coordinator
  • Group Coordinator sends the allocation plan to all Consumers
  • Consumers receive the allocation plan and start consuming

4. Completion Phase

  • Consumers start consuming assigned Partitions
  • Rebalance process completes

Rebalance Strategies

Range Strategy (Default)

Principle: Allocate Partitions in order of Partition and Consumer order

Allocation Rules:

  • Sort Partitions in numerical order
  • Sort Consumers by name
  • Each Consumer gets a continuous range of Partitions

Example:

shell
Topic: test-topic, Partitions: 0,1,2,3,4,5 Consumers: C1, C2, C3 Allocation Result: C1: 0,1 C2: 2,3 C3: 4,5

Characteristics:

  • Relatively even allocation
  • May cause uneven allocation (when Partition count is not divisible by Consumer count)

RoundRobin Strategy

Principle: Round-robin allocation of Partitions

Allocation Rules:

  • Merge Partitions from all Topics
  • Allocate to Consumers in round-robin fashion

Example:

shell
Topic1: 0,1,2 Topic2: 0,1 Consumers: C1, C2 Allocation Result: C1: Topic1-0, Topic1-2, Topic2-1 C2: Topic1-1, Topic2-0

Characteristics:

  • More even allocation
  • Suitable for multiple Topics

Sticky Strategy

Principle: Maintain original allocation as much as possible while ensuring even allocation

Characteristics:

  • Reduce Partition movement between Consumers
  • Reduce Rebalance impact
  • Improve consumption continuity

CooperativeSticky Strategy

Principle: Incremental Rebalance, only reallocate affected Partitions

Characteristics:

  • Reduce Stop-the-world time
  • Improve system availability
  • Suitable for scenarios with high continuity requirements

Rebalance Configuration

Key Parameters

properties
# Session timeout session.timeout.ms=30000 # Heartbeat interval heartbeat.interval.ms=3000 # Maximum poll interval max.poll.interval.ms=300000 # Rebalance timeout max.poll.records=500

Parameter Description

  • session.timeout.ms: Consumer is considered invalid if it doesn't send heartbeat for this long
  • heartbeat.interval.ms: Interval at which Consumer sends heartbeats
  • max.poll.interval.ms: Maximum interval between Consumer polls
  • max.poll.records: Maximum number of messages returned per poll

Rebalance Problems and Solutions

1. Frequent Rebalance Triggers

Causes:

  • Consumers frequently go online/offline
  • Network instability causes heartbeat timeout
  • Message processing takes too long

Solutions:

properties
# Increase session timeout session.timeout.ms=60000 # Increase heartbeat interval heartbeat.interval.ms=5000 # Increase poll interval max.poll.interval.ms=600000

2. Long Rebalance Time

Causes:

  • Too many Consumers
  • Too many Partitions
  • High network latency

Solutions:

  • Use CooperativeSticky strategy
  • Reduce Consumer count
  • Optimize network configuration

3. Rebalance Causes Duplicate Message Consumption

Causes:

  • Consumers may consume duplicate messages during Rebalance
  • Offset not committed in time

Solutions:

properties
# Disable auto commit enable.auto.commit=false # Manual commit Offset consumer.commitSync();

4. Rebalance Causes Consumption Interruption

Causes:

  • Consumers stop consuming during Rebalance
  • Stop-the-world time too long

Solutions:

  • Use CooperativeSticky strategy
  • Reduce Rebalance trigger frequency
  • Optimize Rebalance configuration

Best Practices

1. Reasonably Configure Parameters

properties
# Recommended configuration session.timeout.ms=30000 heartbeat.interval.ms=3000 max.poll.interval.ms=300000 max.poll.records=500

2. Choose Appropriate Rebalance Strategy

  • General Scenarios: Use Range or RoundRobin
  • Need Continuity: Use Sticky
  • High Availability Requirements: Use CooperativeSticky

3. Monitor Rebalance

bash
# View Consumer Group status kafka-consumer-groups --bootstrap-server localhost:9092 \ --describe --group my-group # View Rebalance logs tail -f /path/to/kafka/logs/server.log | grep Rebalance

4. Optimize Consumption Logic

  • Avoid long processing time for single messages
  • Use asynchronous processing to improve efficiency
  • Reasonably set batch processing size

5. Prevent Rebalance

  • Keep Consumers running stably
  • Avoid frequent Consumer start/stop
  • Monitor Consumer health status

Rebalance Monitoring Metrics

  • RebalanceRatePerSec: Rebalance times per second
  • RebalanceTotal: Total Rebalance times
  • FailedRebalanceRate: Failed Rebalance ratio
  • SuccessfulRebalanceRate: Successful Rebalance ratio

Through reasonable configuration and optimization of the Rebalance mechanism, the impact of Rebalance on the system can be effectively reduced, improving Kafka consumption stability and availability.

标签:Kafka