Answer
Zookeeper operations and monitoring are key to ensuring stable cluster operation, requiring a comprehensive monitoring system and operational processes.
1. Deployment Architecture
Production Environment Recommended Architecture:
- 5-node cluster (1 Leader + 4 Followers)
- Cross-availability zone deployment
- Independent disk storage for transaction logs
- Load balancer distributing client connections
Deployment Checklist:
bash# 1. Check Java version java -version # Recommend JDK 8 or 11 # 2. Check network connectivity ping <other-nodes> # 3. Check firewall telnet <node> 2181 # 4. Check disk space df -h # 5. Check system resources free -h top
2. Configuration Management
Core Configuration Parameters:
properties# Basic configuration tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper/data dataLogDir=/data/zookeeper/logs # Cluster configuration server.1=zk1:2888:3888 server.2=zk2:2888:3888 server.3=zk3:2888:3888 server.4=zk4:2888:3888 server.5=zk5:2888:3888 # Performance configuration maxClientCnxns=100 preAllocSize=65536 snapCount=100000 # Auto cleanup autopurge.snapRetainCount=3 autopurge.purgeInterval=1 # JVM configuration # Set in startup script
Configuration Best Practices:
- Unified configuration management
- Version control configuration files
- Configuration change review
- Gray release configuration
3. Start and Stop
Start Cluster:
bash# Start single node zkServer.sh start # Start all nodes for node in zk1 zk2 zk3 zk4 zk5; do ssh $node "zkServer.sh start" done # Check startup status zkServer.sh status
Stop Cluster:
bash# Stop single node zkServer.sh stop # Stop all nodes for node in zk1 zk2 zk3 zk4 zk5; do ssh $node "zkServer.sh stop" done # Check stop status jps | grep QuorumPeerMain
Rolling Restart:
bash# 1. Restart Follower nodes # 2. Wait for cluster recovery # 3. Restart Leader node # 4. Verify cluster status
4. Monitoring Metrics
Key Monitoring Metrics:
1. Cluster Status Metrics:
bash# View cluster mode echo stat | nc localhost 2181 # Mode: leader / follower / observer # View Zxid echo stat | nc localhost 2181 # Zxid: 0x1000000002
2. Performance Metrics:
bash# View latency echo mntr | nc localhost 2181 | grep latency # zk_avg_latency 0.5 # zk_max_latency 10.2 # View throughput echo mntr | nc localhost 2181 | grep packets # zk_packets_received 1000000 # zk_packets_sent 1000000
3. Connection Metrics:
bash# View connection count echo cons | nc localhost 2181 | wc -l # View connection details echo cons | nc localhost 2181
4. Watcher Metrics:
bash# View Watcher count echo wchs | nc localhost 2181 # 100 connections watching 200 paths # View Watcher details echo wchp | nc localhost 2181
5. Node Metrics:
bash# View node statistics echo dump | nc localhost 2181 # View node count echo stat | nc localhost 2181 | grep -E "Node count"
5. Alert Configuration
Alert Rules:
1. Latency Alert:
yaml# Alert threshold - alert: ZookeeperHighLatency expr: zookeeper_avg_latency > 10 for: 5m labels: severity: warning annotations: summary: "Zookeeper high latency detected"
2. Connection Count Alert:
yaml- alert: ZookeeperHighConnections expr: zookeeper_num_alive_connections > 1000 for: 5m labels: severity: warning annotations: summary: "Zookeeper high connections detected"
3. Memory Alert:
yaml- alert: ZookeeperHighMemory expr: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.8 for: 5m labels: severity: critical annotations: summary: "Zookeeper high memory usage detected"
4. Node Offline Alert:
yaml- alert: ZookeeperNodeDown expr: up{job="zookeeper"} == 0 for: 1m labels: severity: critical annotations: summary: "Zookeeper node is down"
6. Log Management
Log Configuration:
properties# log4j.properties log4j.rootLogger=INFO, ROLLINGFILE log4j.appender.ROLLINGFILE=org.apache.log4j.RollingFileAppender log4j.appender.ROLLINGFILE.File=/data/zookeeper/logs/zookeeper.log log4j.appender.ROLLINGFILE.MaxFileSize=100MB log4j.appender.ROLLINGFILE.MaxBackupIndex=10 log4j.appender.ROLLINGFILE.layout=org.apache.log4j.PatternLayout log4j.appender.ROLLINGFILE.layout.ConversionPattern=%d{ISO8601} [%t] %-5p %c{2} - %m%n
Log Analysis:
bash# View error logs grep ERROR /data/zookeeper/logs/zookeeper.log # View warning logs grep WARN /data/zookeeper/logs/zookeeper.log # Count errors grep -c ERROR /data/zookeeper/logs/zookeeper.log # Real-time log monitoring tail -f /data/zookeeper/logs/zookeeper.log
7. Backup and Recovery
Backup Strategy:
bash# 1. Backup transaction logs #!/bin/bash BACKUP_DIR=/backup/zookeeper/$(date +%Y%m%d) mkdir -p $BACKUP_DIR cp -r /data/zookeeper/logs $BACKUP_DIR/ # 2. Backup snapshot files cp -r /data/zookeeper/data/version-2 $BACKUP_DIR/ # 3. Backup configuration files cp /opt/zookeeper/conf/zoo.cfg $BACKUP_DIR/ # 4. Compress backup tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR/ # 5. Clean old backups find /backup/zookeeper -mtime +7 -delete
Recovery Process:
bash# 1. Stop cluster zkServer.sh stop # 2. Restore transaction logs cp -r /backup/zookeeper/20260120/logs /data/zookeeper/ # 3. Restore snapshot files cp -r /backup/zookeeper/20260120/data/version-2 /data/zookeeper/data/ # 4. Start cluster zkServer.sh start # 5. Verify data zkCli.sh -server localhost:2181 ls /
8. Troubleshooting
Common Troubleshooting Steps:
1. Node Cannot Start:
bash# Check logs tail -100 /data/zookeeper/logs/zookeeper.log # Check port usage netstat -tlnp | grep 2181 # Check configuration file cat /opt/zookeeper/conf/zoo.cfg # Check myid file cat /data/zookeeper/data/myid
2. Cluster Election Failure:
bash# Check network connectivity ping <other-nodes> # Check firewall telnet <node> 2888 telnet <node> 3888 # Check node status echo stat | nc localhost 2181 # Check election timeout grep electionTimeout /opt/zookeeper/conf/zoo.cfg
3. Performance Degradation:
bash# Check latency echo mntr | nc localhost 2181 | grep latency # Check disk I/O iostat -x 1 # Check network sar -n DEV 1 # Check CPU top
9. Capacity Planning
Capacity Assessment:
bash# 1. Assess node count # Determine cluster scale based on business needs # Small scale: 3 nodes # Medium scale: 5 nodes # Large scale: 7 nodes # 2. Assess storage needs # Transaction logs: expected write volume * retention time # Snapshot files: node count * average size * retention count # 3. Assess network bandwidth # Peak throughput * packet size # 4. Assess client connections # Expected client count * concurrent connections
Scaling Process:
bash# 1. Prepare new node # Install Zookeeper # Configure zoo.cfg # Create myid file # 2. Update all node configurations # Add new node to server list # 3. Start new node zkServer.sh start # 4. Wait for data sync # Monitor sync status # 5. Verify cluster echo stat | nc localhost 2181
10. Security Hardening
Security Configuration:
properties# 1. Enable authentication authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider requireClientAuthScheme=sasl # 2. Configure ACL # Specify ACL when creating nodes # 3. Network isolation # Use firewall to restrict access # Use VPN or dedicated network # 4. Log auditing # Record all operation logs
Security Check:
bash# 1. Check ACL configuration zkCli.sh -server localhost:2181 getAcl / # 2. Check authentication status echo envi | nc localhost 2181 | grep -E "auth" # 3. Check network connections netstat -tlnp | grep 2181 # 4. Check log auditing grep "auth" /data/zookeeper/logs/zookeeper.log
11. Operations Automation
Automation Scripts:
bash# 1. Health check script #!/bin/bash for node in zk1 zk2 zk3 zk4 zk5; do status=$(echo stat | nc $node 2181 | grep -E "Mode") echo "$node: $status" done # 2. Auto backup script # See backup strategy section # 3. Auto cleanup script #!/bin/bash # Clean old snapshots find /data/zookeeper/data/version-2 -name "snapshot.*" -mtime +7 -delete # 4. Monitoring script #!/bin/bash # Monitor latency latency=$(echo mntr | nc localhost 2181 | grep avg_latency | awk '{print $2}') if [ $(echo "$latency > 10" | bc) -eq 1 ]; then echo "High latency: $latency" fi
12. Operations Documentation
Documentation Checklist:
- Deployment documentation
- Configuration documentation
- Monitoring documentation
- Troubleshooting documentation
- Backup and recovery documentation
- Security documentation
- Change records
- Contact information
Change Management:
- Change request
- Change review
- Change implementation
- Change verification
- Change record