Answer
Zookeeper's Leader election mechanism is the core of ensuring cluster high availability, implemented based on the ZAB protocol.
Election Trigger Timing
- During cluster startup: All nodes participate in election to elect a Leader
- When Leader fails: Followers detect Leader failure and trigger re-election
- When Leader exits voluntarily: Leader shuts down normally, triggering election
Election Algorithm
Zookeeper uses the Fast Leader Election algorithm:
Voting Structure:
- sid: Server ID, specified in configuration file
- zxid: Transaction ID, indicating data update count
- epoch: Election cycle, increments with each election
Election Rules:
- Compare zxid first: Larger zxid means newer data, priority for election
- Then compare sid: When zxid is the same, larger sid has priority
Election Process
-
Initialize voting:
- Each node votes for itself first
- Voting information: (epoch, zxid, sid)
-
Vote exchange:
- Nodes exchange voting information with each other
- Update their own voting status
-
Vote counting:
- Count votes for each candidate
- Candidate supported by more than half of nodes wins
-
Election complete:
- Winner becomes Leader
- Other nodes become Followers
- Leader starts processing requests
Election States
Nodes have the following states during election:
- LOOKING: Looking for Leader, participating in election
- FOLLOWING: Found Leader, running as Follower
- LEADING: Running as Leader
- OBSERVING: Running as Observer
Election Optimization
Fast Election:
- Nodes prioritize voting for the node with most data updates
- Reduce voting rounds, speed up election
Vote Validation:
- Validate legitimacy of voting information
- Prevent invalid votes from interfering with election
Timeout Mechanism:
- Set reasonable election timeout
- Avoid long-term election blocking
Cluster Scale Impact
- 3-node cluster: 2 nodes agreeing is sufficient for election
- 5-node cluster: 3 nodes agreeing is sufficient for election
- 7-node cluster: 4 nodes agreeing is sufficient for election
Considerations
- Split-brain problem: Avoided through majority mechanism
- Network partition: Cannot elect Leader after partition
- Election time: Usually completes within a few seconds
- Data consistency: No write requests processed during election