Kafka Replication Mechanism
Kafka's replication mechanism is the core of its high availability and fault tolerance. Through the replication mechanism, Kafka can ensure data is not lost when nodes fail and continue to provide services.
Replica Basic Concepts
Replica Roles
-
Leader Replica
- Responsible for handling all read and write requests
- Each Partition has only one Leader
- The Broker where the Leader is located handles all Producer and Consumer requests
-
Follower Replica
- Syncs data from Leader
- Does not handle client requests
- Can become the new Leader
-
ISR(In-Sync Replicas)
- Set of replicas that are in sync with Leader
- Data of replicas in ISR is completely consistent with Leader
- Only replicas in ISR are eligible to be elected as new Leader
Replica Synchronization Mechanism
Synchronization Process
-
Producer Sends Message
- Producer sends message to Leader
- Leader writes message to local log
-
Leader Syncs to Follower
- Leader sends message to all Followers in ISR
- Follower receives message and writes to local log
- Follower sends acknowledgment to Leader
-
Acknowledgment Mechanism
- Leader returns success to Producer after receiving acknowledgments from all Followers in ISR
- Determine the number of acknowledgments to wait for based on acks parameter
Synchronization Configuration
properties# Replica factor default.replication.factor=3 # Minimum in-sync replicas min.insync.replicas=2 # Replica maximum lag time replica.lag.time.max.ms=30000 # Replica maximum lag messages replica.lag.max.messages=4000
Leader Election Mechanism
Election Trigger Conditions
-
Leader Failure
- Broker where Leader is located crashes
- Leader network partition
-
Controller Failure
- Controller is responsible for managing cluster state
- Re-election when Controller fails
Election Process
-
Detect Failure
- ZooKeeper detects Leader failure
- Controller receives failure notification
-
Select New Leader
- Select the replica with the highest rank in AR (Assigned Replicas) from ISR
- Prioritize replicas in ISR
- If ISR is empty, select from AR
-
Update Metadata
- Controller updates metadata in ZooKeeper
- Notify all Brokers of new Leader information
Election Strategy
- AR(Assigned Replicas): All assigned replicas
- ISR(In-Sync Replicas): Replicas in sync with Leader
- OSR(Out-of-Sync Replicas): Replicas not in sync with Leader
Replica Management
Replica Allocation
properties# Replica factor for auto-created Topics default.replication.factor=3 # Topic level replica factor replication.factor=3
Allocation Principles:
- Replicas are evenly distributed across different Brokers
- Replicas of the same Partition are not on the same Broker
- Consider rack awareness, replicas distributed across different racks
Replica Decommissioning
-
Graceful Decommissioning
- Use kafka-reassign-partitions tool
- Migrate Leader first, then decommission replica
- Ensure no data loss
-
Failure Decommissioning
- Automatically trigger Leader election
- Select new Leader from ISR
- Rebuild replica to maintain replica count
Fault Tolerance Mechanism
Failure Scenario Handling
-
Follower Failure
- Follower is removed from ISR
- Leader continues to serve
- Follower rejoins ISR after recovery
-
Leader Failure
- Trigger Leader election
- Select new Leader from ISR
- Ensure data consistency
-
Multiple Replica Failures
- If replicas in ISR >= min.insync.replicas, continue serving
- If replicas in ISR < min.insync.replicas, reject writes
Performance Optimization
Replica Count Selection
- Replica count = 1: No fault tolerance, best performance
- Replica count = 2: Single point fault tolerance, good performance
- Replica count = 3: Recommended configuration, balances performance and reliability
- Replica count > 3: High reliability, but performance decreases
Synchronization Optimization
properties# Reduce sync latency replica.lag.time.max.ms=10000 # Optimize network configuration socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 # Optimize I/O configuration num.io.threads=16
Monitoring Metrics
Replica Synchronization Metrics
- UnderReplicatedPartitions: Number of under-replicated partitions
- IsrShrinksPerSec: ISR shrink rate
- IsrExpandsPerSec: ISR expansion rate
- OfflineReplicasCount: Number of offline replicas
Leader Election Metrics
- LeaderElectionRate: Leader election rate
- ActiveControllerCount: Number of active Controllers
Best Practices
-
Reasonably Set Replica Count
- Production environment recommends at least 3 replicas
- Adjust replica count based on business importance
- Consider storage cost and performance impact
-
Monitor Replica Status
- Regularly check ISR status
- Monitor replica sync lag
- Handle replica exceptions in time
-
Plan Broker Distribution
- Replicas distributed on different physical machines
- Consider rack and data center distribution
- Avoid single point of failure
-
Regular Testing
- Simulate Broker failures
- Verify fault tolerance mechanisms
- Test recovery time
-
Backup Strategy
- Regularly backup Kafka data
- Establish disaster recovery plans
- Test backup and recovery processes
Through reasonable configuration and management of Kafka's replication mechanism, good performance can be provided while ensuring data reliability.