How does Kafka's replication mechanism work? - 面试题

Kafka Replication Mechanism

Kafka's replication mechanism is the core of its high availability and fault tolerance. Through the replication mechanism, Kafka can ensure data is not lost when nodes fail and continue to provide services.

Replica Basic Concepts

Replica Roles

Leader Replica
- Responsible for handling all read and write requests
- Each Partition has only one Leader
- The Broker where the Leader is located handles all Producer and Consumer requests
Follower Replica
- Syncs data from Leader
- Does not handle client requests
- Can become the new Leader
ISR（In-Sync Replicas）
- Set of replicas that are in sync with Leader
- Data of replicas in ISR is completely consistent with Leader
- Only replicas in ISR are eligible to be elected as new Leader

Replica Synchronization Mechanism

Synchronization Process

Producer Sends Message
- Producer sends message to Leader
- Leader writes message to local log
Leader Syncs to Follower
- Leader sends message to all Followers in ISR
- Follower receives message and writes to local log
- Follower sends acknowledgment to Leader
Acknowledgment Mechanism
- Leader returns success to Producer after receiving acknowledgments from all Followers in ISR
- Determine the number of acknowledgments to wait for based on acks parameter

Synchronization Configuration

properties
# Replica factor
default.replication.factor=3

# Minimum in-sync replicas
min.insync.replicas=2

# Replica maximum lag time
replica.lag.time.max.ms=30000

# Replica maximum lag messages
replica.lag.max.messages=4000

Leader Election Mechanism

Election Trigger Conditions

Leader Failure
- Broker where Leader is located crashes
- Leader network partition
Controller Failure
- Controller is responsible for managing cluster state
- Re-election when Controller fails

Election Process

Detect Failure
- ZooKeeper detects Leader failure
- Controller receives failure notification
Select New Leader
- Select the replica with the highest rank in AR (Assigned Replicas) from ISR
- Prioritize replicas in ISR
- If ISR is empty, select from AR
Update Metadata
- Controller updates metadata in ZooKeeper
- Notify all Brokers of new Leader information

Election Strategy

AR（Assigned Replicas）: All assigned replicas
ISR（In-Sync Replicas）: Replicas in sync with Leader
OSR（Out-of-Sync Replicas）: Replicas not in sync with Leader

Replica Management

Replica Allocation

properties
# Replica factor for auto-created Topics
default.replication.factor=3

# Topic level replica factor
replication.factor=3

Allocation Principles:

Replicas are evenly distributed across different Brokers
Replicas of the same Partition are not on the same Broker
Consider rack awareness, replicas distributed across different racks

Replica Decommissioning

Graceful Decommissioning
- Use kafka-reassign-partitions tool
- Migrate Leader first, then decommission replica
- Ensure no data loss
Failure Decommissioning
- Automatically trigger Leader election
- Select new Leader from ISR
- Rebuild replica to maintain replica count

Fault Tolerance Mechanism

Failure Scenario Handling

Follower Failure
- Follower is removed from ISR
- Leader continues to serve
- Follower rejoins ISR after recovery
Leader Failure
- Trigger Leader election
- Select new Leader from ISR
- Ensure data consistency
Multiple Replica Failures
- If replicas in ISR >= min.insync.replicas, continue serving
- If replicas in ISR < min.insync.replicas, reject writes

Performance Optimization

Replica Count Selection

Replica count = 1: No fault tolerance, best performance
Replica count = 2: Single point fault tolerance, good performance
Replica count = 3: Recommended configuration, balances performance and reliability
Replica count > 3: High reliability, but performance decreases

Synchronization Optimization

properties
# Reduce sync latency
replica.lag.time.max.ms=10000

# Optimize network configuration
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400

# Optimize I/O configuration
num.io.threads=16

Monitoring Metrics

Replica Synchronization Metrics

UnderReplicatedPartitions: Number of under-replicated partitions
IsrShrinksPerSec: ISR shrink rate
IsrExpandsPerSec: ISR expansion rate
OfflineReplicasCount: Number of offline replicas

Leader Election Metrics

LeaderElectionRate: Leader election rate
ActiveControllerCount: Number of active Controllers

Best Practices

Reasonably Set Replica Count
- Production environment recommends at least 3 replicas
- Adjust replica count based on business importance
- Consider storage cost and performance impact
Monitor Replica Status
- Regularly check ISR status
- Monitor replica sync lag
- Handle replica exceptions in time
Plan Broker Distribution
- Replicas distributed on different physical machines
- Consider rack and data center distribution
- Avoid single point of failure
Regular Testing
- Simulate Broker failures
- Verify fault tolerance mechanisms
- Test recovery time
Backup Strategy
- Regularly backup Kafka data
- Establish disaster recovery plans
- Test backup and recovery processes

Through reasonable configuration and management of Kafka's replication mechanism, good performance can be provided while ensuring data reliability.