乐闻世界logo
搜索文章和话题

How to perform Zookeeper operations and monitoring? What are the key metrics and alert rules?

2月21日 16:24

Answer

Zookeeper operations and monitoring are key to ensuring stable cluster operation, requiring a comprehensive monitoring system and operational processes.

1. Deployment Architecture

Production Environment Recommended Architecture:

  • 5-node cluster (1 Leader + 4 Followers)
  • Cross-availability zone deployment
  • Independent disk storage for transaction logs
  • Load balancer distributing client connections

Deployment Checklist:

bash
# 1. Check Java version java -version # Recommend JDK 8 or 11 # 2. Check network connectivity ping <other-nodes> # 3. Check firewall telnet <node> 2181 # 4. Check disk space df -h # 5. Check system resources free -h top

2. Configuration Management

Core Configuration Parameters:

properties
# Basic configuration tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper/data dataLogDir=/data/zookeeper/logs # Cluster configuration server.1=zk1:2888:3888 server.2=zk2:2888:3888 server.3=zk3:2888:3888 server.4=zk4:2888:3888 server.5=zk5:2888:3888 # Performance configuration maxClientCnxns=100 preAllocSize=65536 snapCount=100000 # Auto cleanup autopurge.snapRetainCount=3 autopurge.purgeInterval=1 # JVM configuration # Set in startup script

Configuration Best Practices:

  • Unified configuration management
  • Version control configuration files
  • Configuration change review
  • Gray release configuration

3. Start and Stop

Start Cluster:

bash
# Start single node zkServer.sh start # Start all nodes for node in zk1 zk2 zk3 zk4 zk5; do ssh $node "zkServer.sh start" done # Check startup status zkServer.sh status

Stop Cluster:

bash
# Stop single node zkServer.sh stop # Stop all nodes for node in zk1 zk2 zk3 zk4 zk5; do ssh $node "zkServer.sh stop" done # Check stop status jps | grep QuorumPeerMain

Rolling Restart:

bash
# 1. Restart Follower nodes # 2. Wait for cluster recovery # 3. Restart Leader node # 4. Verify cluster status

4. Monitoring Metrics

Key Monitoring Metrics:

1. Cluster Status Metrics:

bash
# View cluster mode echo stat | nc localhost 2181 # Mode: leader / follower / observer # View Zxid echo stat | nc localhost 2181 # Zxid: 0x1000000002

2. Performance Metrics:

bash
# View latency echo mntr | nc localhost 2181 | grep latency # zk_avg_latency 0.5 # zk_max_latency 10.2 # View throughput echo mntr | nc localhost 2181 | grep packets # zk_packets_received 1000000 # zk_packets_sent 1000000

3. Connection Metrics:

bash
# View connection count echo cons | nc localhost 2181 | wc -l # View connection details echo cons | nc localhost 2181

4. Watcher Metrics:

bash
# View Watcher count echo wchs | nc localhost 2181 # 100 connections watching 200 paths # View Watcher details echo wchp | nc localhost 2181

5. Node Metrics:

bash
# View node statistics echo dump | nc localhost 2181 # View node count echo stat | nc localhost 2181 | grep -E "Node count"

5. Alert Configuration

Alert Rules:

1. Latency Alert:

yaml
# Alert threshold - alert: ZookeeperHighLatency expr: zookeeper_avg_latency > 10 for: 5m labels: severity: warning annotations: summary: "Zookeeper high latency detected"

2. Connection Count Alert:

yaml
- alert: ZookeeperHighConnections expr: zookeeper_num_alive_connections > 1000 for: 5m labels: severity: warning annotations: summary: "Zookeeper high connections detected"

3. Memory Alert:

yaml
- alert: ZookeeperHighMemory expr: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.8 for: 5m labels: severity: critical annotations: summary: "Zookeeper high memory usage detected"

4. Node Offline Alert:

yaml
- alert: ZookeeperNodeDown expr: up{job="zookeeper"} == 0 for: 1m labels: severity: critical annotations: summary: "Zookeeper node is down"

6. Log Management

Log Configuration:

properties
# log4j.properties log4j.rootLogger=INFO, ROLLINGFILE log4j.appender.ROLLINGFILE=org.apache.log4j.RollingFileAppender log4j.appender.ROLLINGFILE.File=/data/zookeeper/logs/zookeeper.log log4j.appender.ROLLINGFILE.MaxFileSize=100MB log4j.appender.ROLLINGFILE.MaxBackupIndex=10 log4j.appender.ROLLINGFILE.layout=org.apache.log4j.PatternLayout log4j.appender.ROLLINGFILE.layout.ConversionPattern=%d{ISO8601} [%t] %-5p %c{2} - %m%n

Log Analysis:

bash
# View error logs grep ERROR /data/zookeeper/logs/zookeeper.log # View warning logs grep WARN /data/zookeeper/logs/zookeeper.log # Count errors grep -c ERROR /data/zookeeper/logs/zookeeper.log # Real-time log monitoring tail -f /data/zookeeper/logs/zookeeper.log

7. Backup and Recovery

Backup Strategy:

bash
# 1. Backup transaction logs #!/bin/bash BACKUP_DIR=/backup/zookeeper/$(date +%Y%m%d) mkdir -p $BACKUP_DIR cp -r /data/zookeeper/logs $BACKUP_DIR/ # 2. Backup snapshot files cp -r /data/zookeeper/data/version-2 $BACKUP_DIR/ # 3. Backup configuration files cp /opt/zookeeper/conf/zoo.cfg $BACKUP_DIR/ # 4. Compress backup tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR/ # 5. Clean old backups find /backup/zookeeper -mtime +7 -delete

Recovery Process:

bash
# 1. Stop cluster zkServer.sh stop # 2. Restore transaction logs cp -r /backup/zookeeper/20260120/logs /data/zookeeper/ # 3. Restore snapshot files cp -r /backup/zookeeper/20260120/data/version-2 /data/zookeeper/data/ # 4. Start cluster zkServer.sh start # 5. Verify data zkCli.sh -server localhost:2181 ls /

8. Troubleshooting

Common Troubleshooting Steps:

1. Node Cannot Start:

bash
# Check logs tail -100 /data/zookeeper/logs/zookeeper.log # Check port usage netstat -tlnp | grep 2181 # Check configuration file cat /opt/zookeeper/conf/zoo.cfg # Check myid file cat /data/zookeeper/data/myid

2. Cluster Election Failure:

bash
# Check network connectivity ping <other-nodes> # Check firewall telnet <node> 2888 telnet <node> 3888 # Check node status echo stat | nc localhost 2181 # Check election timeout grep electionTimeout /opt/zookeeper/conf/zoo.cfg

3. Performance Degradation:

bash
# Check latency echo mntr | nc localhost 2181 | grep latency # Check disk I/O iostat -x 1 # Check network sar -n DEV 1 # Check CPU top

9. Capacity Planning

Capacity Assessment:

bash
# 1. Assess node count # Determine cluster scale based on business needs # Small scale: 3 nodes # Medium scale: 5 nodes # Large scale: 7 nodes # 2. Assess storage needs # Transaction logs: expected write volume * retention time # Snapshot files: node count * average size * retention count # 3. Assess network bandwidth # Peak throughput * packet size # 4. Assess client connections # Expected client count * concurrent connections

Scaling Process:

bash
# 1. Prepare new node # Install Zookeeper # Configure zoo.cfg # Create myid file # 2. Update all node configurations # Add new node to server list # 3. Start new node zkServer.sh start # 4. Wait for data sync # Monitor sync status # 5. Verify cluster echo stat | nc localhost 2181

10. Security Hardening

Security Configuration:

properties
# 1. Enable authentication authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider requireClientAuthScheme=sasl # 2. Configure ACL # Specify ACL when creating nodes # 3. Network isolation # Use firewall to restrict access # Use VPN or dedicated network # 4. Log auditing # Record all operation logs

Security Check:

bash
# 1. Check ACL configuration zkCli.sh -server localhost:2181 getAcl / # 2. Check authentication status echo envi | nc localhost 2181 | grep -E "auth" # 3. Check network connections netstat -tlnp | grep 2181 # 4. Check log auditing grep "auth" /data/zookeeper/logs/zookeeper.log

11. Operations Automation

Automation Scripts:

bash
# 1. Health check script #!/bin/bash for node in zk1 zk2 zk3 zk4 zk5; do status=$(echo stat | nc $node 2181 | grep -E "Mode") echo "$node: $status" done # 2. Auto backup script # See backup strategy section # 3. Auto cleanup script #!/bin/bash # Clean old snapshots find /data/zookeeper/data/version-2 -name "snapshot.*" -mtime +7 -delete # 4. Monitoring script #!/bin/bash # Monitor latency latency=$(echo mntr | nc localhost 2181 | grep avg_latency | awk '{print $2}') if [ $(echo "$latency > 10" | bc) -eq 1 ]; then echo "High latency: $latency" fi

12. Operations Documentation

Documentation Checklist:

  1. Deployment documentation
  2. Configuration documentation
  3. Monitoring documentation
  4. Troubleshooting documentation
  5. Backup and recovery documentation
  6. Security documentation
  7. Change records
  8. Contact information

Change Management:

  1. Change request
  2. Change review
  3. Change implementation
  4. Change verification
  5. Change record
标签:Zookeeper