How to perform Zookeeper operations and monitoring? What are the key metrics and alert rules? - 面试题

Answer

Zookeeper operations and monitoring are key to ensuring stable cluster operation, requiring a comprehensive monitoring system and operational processes.

1. Deployment Architecture

Production Environment Recommended Architecture:

5-node cluster (1 Leader + 4 Followers)
Cross-availability zone deployment
Independent disk storage for transaction logs
Load balancer distributing client connections

Deployment Checklist:

bash
# 1. Check Java version
java -version  # Recommend JDK 8 or 11

# 2. Check network connectivity
ping <other-nodes>

# 3. Check firewall
telnet <node> 2181

# 4. Check disk space
df -h

# 5. Check system resources
free -h
top

2. Configuration Management

Core Configuration Parameters:

properties
# Basic configuration
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
dataLogDir=/data/zookeeper/logs

# Cluster configuration
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888
server.4=zk4:2888:3888
server.5=zk5:2888:3888

# Performance configuration
maxClientCnxns=100
preAllocSize=65536
snapCount=100000

# Auto cleanup
autopurge.snapRetainCount=3
autopurge.purgeInterval=1

# JVM configuration
# Set in startup script

Configuration Best Practices:

Unified configuration management
Version control configuration files
Configuration change review
Gray release configuration

3. Start and Stop

Start Cluster:

bash
# Start single node
zkServer.sh start

# Start all nodes
for node in zk1 zk2 zk3 zk4 zk5; do
    ssh $node "zkServer.sh start"
done

# Check startup status
zkServer.sh status

Stop Cluster:

bash
# Stop single node
zkServer.sh stop

# Stop all nodes
for node in zk1 zk2 zk3 zk4 zk5; do
    ssh $node "zkServer.sh stop"
done

# Check stop status
jps | grep QuorumPeerMain

Rolling Restart:

bash
# 1. Restart Follower nodes
# 2. Wait for cluster recovery
# 3. Restart Leader node
# 4. Verify cluster status

4. Monitoring Metrics

Key Monitoring Metrics:

1. Cluster Status Metrics:

bash
# View cluster mode
echo stat | nc localhost 2181
# Mode: leader / follower / observer

# View Zxid
echo stat | nc localhost 2181
# Zxid: 0x1000000002

2. Performance Metrics:

bash
# View latency
echo mntr | nc localhost 2181 | grep latency
# zk_avg_latency 0.5
# zk_max_latency 10.2

# View throughput
echo mntr | nc localhost 2181 | grep packets
# zk_packets_received 1000000
# zk_packets_sent 1000000

3. Connection Metrics:

bash
# View connection count
echo cons | nc localhost 2181 | wc -l

# View connection details
echo cons | nc localhost 2181

4. Watcher Metrics:

bash
# View Watcher count
echo wchs | nc localhost 2181
# 100 connections watching 200 paths

# View Watcher details
echo wchp | nc localhost 2181

5. Node Metrics:

bash
# View node statistics
echo dump | nc localhost 2181

# View node count
echo stat | nc localhost 2181 | grep -E "Node count"

5. Alert Configuration

Alert Rules:

1. Latency Alert:

yaml
# Alert threshold
- alert: ZookeeperHighLatency
  expr: zookeeper_avg_latency > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Zookeeper high latency detected"

2. Connection Count Alert:

yaml
- alert: ZookeeperHighConnections
  expr: zookeeper_num_alive_connections > 1000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Zookeeper high connections detected"

3. Memory Alert:

yaml
- alert: ZookeeperHighMemory
  expr: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.8
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Zookeeper high memory usage detected"

4. Node Offline Alert:

yaml
- alert: ZookeeperNodeDown
  expr: up{job="zookeeper"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Zookeeper node is down"

6. Log Management

Log Configuration:

properties
# log4j.properties
log4j.rootLogger=INFO, ROLLINGFILE

log4j.appender.ROLLINGFILE=org.apache.log4j.RollingFileAppender
log4j.appender.ROLLINGFILE.File=/data/zookeeper/logs/zookeeper.log
log4j.appender.ROLLINGFILE.MaxFileSize=100MB
log4j.appender.ROLLINGFILE.MaxBackupIndex=10
log4j.appender.ROLLINGFILE.layout=org.apache.log4j.PatternLayout
log4j.appender.ROLLINGFILE.layout.ConversionPattern=%d{ISO8601} [%t] %-5p %c{2} - %m%n

Log Analysis:

bash
# View error logs
grep ERROR /data/zookeeper/logs/zookeeper.log

# View warning logs
grep WARN /data/zookeeper/logs/zookeeper.log

# Count errors
grep -c ERROR /data/zookeeper/logs/zookeeper.log

# Real-time log monitoring
tail -f /data/zookeeper/logs/zookeeper.log

7. Backup and Recovery

Backup Strategy:

bash
# 1. Backup transaction logs
#!/bin/bash
BACKUP_DIR=/backup/zookeeper/$(date +%Y%m%d)
mkdir -p $BACKUP_DIR
cp -r /data/zookeeper/logs $BACKUP_DIR/

# 2. Backup snapshot files
cp -r /data/zookeeper/data/version-2 $BACKUP_DIR/

# 3. Backup configuration files
cp /opt/zookeeper/conf/zoo.cfg $BACKUP_DIR/

# 4. Compress backup
tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR/

# 5. Clean old backups
find /backup/zookeeper -mtime +7 -delete

Recovery Process:

bash
# 1. Stop cluster
zkServer.sh stop

# 2. Restore transaction logs
cp -r /backup/zookeeper/20260120/logs /data/zookeeper/

# 3. Restore snapshot files
cp -r /backup/zookeeper/20260120/data/version-2 /data/zookeeper/data/

# 4. Start cluster
zkServer.sh start

# 5. Verify data
zkCli.sh -server localhost:2181
ls /

8. Troubleshooting

Common Troubleshooting Steps:

1. Node Cannot Start:

bash
# Check logs
tail -100 /data/zookeeper/logs/zookeeper.log

# Check port usage
netstat -tlnp | grep 2181

# Check configuration file
cat /opt/zookeeper/conf/zoo.cfg

# Check myid file
cat /data/zookeeper/data/myid

2. Cluster Election Failure:

bash
# Check network connectivity
ping <other-nodes>

# Check firewall
telnet <node> 2888
telnet <node> 3888

# Check node status
echo stat | nc localhost 2181

# Check election timeout
grep electionTimeout /opt/zookeeper/conf/zoo.cfg

3. Performance Degradation:

bash
# Check latency
echo mntr | nc localhost 2181 | grep latency

# Check disk I/O
iostat -x 1

# Check network
sar -n DEV 1

# Check CPU
top

9. Capacity Planning

Capacity Assessment:

bash
# 1. Assess node count
# Determine cluster scale based on business needs
# Small scale: 3 nodes
# Medium scale: 5 nodes
# Large scale: 7 nodes

# 2. Assess storage needs
# Transaction logs: expected write volume * retention time
# Snapshot files: node count * average size * retention count

# 3. Assess network bandwidth
# Peak throughput * packet size

# 4. Assess client connections
# Expected client count * concurrent connections

Scaling Process:

bash
# 1. Prepare new node
# Install Zookeeper
# Configure zoo.cfg
# Create myid file

# 2. Update all node configurations
# Add new node to server list

# 3. Start new node
zkServer.sh start

# 4. Wait for data sync
# Monitor sync status

# 5. Verify cluster
echo stat | nc localhost 2181

10. Security Hardening

Security Configuration:

properties
# 1. Enable authentication
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl

# 2. Configure ACL
# Specify ACL when creating nodes

# 3. Network isolation
# Use firewall to restrict access
# Use VPN or dedicated network

# 4. Log auditing
# Record all operation logs

Security Check:

bash
# 1. Check ACL configuration
zkCli.sh -server localhost:2181
getAcl /

# 2. Check authentication status
echo envi | nc localhost 2181 | grep -E "auth"

# 3. Check network connections
netstat -tlnp | grep 2181

# 4. Check log auditing
grep "auth" /data/zookeeper/logs/zookeeper.log

11. Operations Automation

Automation Scripts:

bash
# 1. Health check script
#!/bin/bash
for node in zk1 zk2 zk3 zk4 zk5; do
    status=$(echo stat | nc $node 2181 | grep -E "Mode")
    echo "$node: $status"
done

# 2. Auto backup script
# See backup strategy section

# 3. Auto cleanup script
#!/bin/bash
# Clean old snapshots
find /data/zookeeper/data/version-2 -name "snapshot.*" -mtime +7 -delete

# 4. Monitoring script
#!/bin/bash
# Monitor latency
latency=$(echo mntr | nc localhost 2181 | grep avg_latency | awk '{print $2}')
if [ $(echo "$latency > 10" | bc) -eq 1 ]; then
    echo "High latency: $latency"
fi

12. Operations Documentation

Documentation Checklist:

Deployment documentation
Configuration documentation
Monitoring documentation
Troubleshooting documentation
Backup and recovery documentation
Security documentation
Change records
Contact information

Change Management:

Change request
Change review
Change implementation
Change verification
Change record