Answer
Zookeeper best practices cover architecture design, development usage, operations management, and other aspects. Following these practices can build stable and efficient distributed systems.
1. Architecture Design Best Practices
Cluster Scale Selection:
- 3 nodes: Suitable for small-scale applications, allows 1 node failure
- 5 nodes: Recommended for production, allows 2 node failures
- 7 nodes: Large-scale applications, allows 3 node failures
- Avoid even number of nodes: Prevent election deadlock
Node Deployment Strategy:
bash# 1. Cross-availability zone deployment # Avoid single point of failure # Improve disaster recovery capability # 2. Network isolation # Use dedicated network # Reduce network latency # 3. Resource isolation # Independent servers # Avoid resource contention
Storage Separation:
properties# Transaction logs use high-performance disk dataLogDir=/data/zookeeper/logs # SSD recommended # Data snapshots use regular disk dataDir=/data/zookeeper/data # HDD acceptable
2. Data Model Design Best Practices
Node Naming Conventions:
java// Use clear namespace /app/{service-name}/{environment}/{component} // Examples /app/payment/prod/config /app/order/dev/leader /app/user/test/locks
Node Hierarchy Design:
- Hierarchy should not be too deep (recommended < 5 levels)
- Avoid too many child nodes (recommended < 1000)
- Reasonably group related nodes
Data Size Control:
java// Single node data < 1MB // Shard large data storage // Wrong example zk.create("/big-data", largeData, ...); // Data too large // Correct example for (int i = 0; i < chunks; i++) { String path = "/data/chunk-" + i; zk.create(path, chunkData[i], ...); }
Node Type Selection:
java// Configuration data: persistent node zk.create("/config", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); // Temporary state: ephemeral node zk.create("/session/123", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // Distributed queue: sequential node zk.create("/queue/item-", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT_SEQUENTIAL);
3. Client Usage Best Practices
Connection Management:
java// Use connection pool CuratorFramework client = CuratorFrameworkFactory.builder() .connectString("localhost:2181") .sessionTimeoutMs(30000) .connectionTimeoutMs(10000) .retryPolicy(new ExponentialBackoffRetry(1000, 3)) .build(); client.start(); // Use try-with-resources to ensure resource release try (ZooKeeper zk = new ZooKeeper(...)) { // Use zk }
Exception Handling:
javatry { zk.create("/path", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); } catch (KeeperException.NodeExistsException e) { // Node already exists logger.warn("Node already exists"); } catch (KeeperException.ConnectionLossException e) { // Connection lost, need retry retry(); } catch (InterruptedException e) { Thread.currentThread().interrupt(); }
Watcher Usage:
java// One-time Watcher, avoid leaks zk.getData("/path", new Watcher() { @Override public void process(WatchedEvent event) { // Handle event handleEvent(event); // Re-register try { zk.getData("/path", this, null); } catch (Exception e) { logger.error("Failed to re-register watcher", e); } } }, null); // Avoid time-consuming operations in Watcher zk.getData("/path", event -> { // Use async processing executor.submit(() -> { processEvent(event); }); }, null);
4. Distributed Lock Best Practices
Lock Implementation:
java// Use Curator's distributed lock InterProcessMutex lock = new InterProcessMutex(client, "/locks/my-lock"); try { // Acquire lock (with timeout) if (lock.acquire(10, TimeUnit.SECONDS)) { try { // Execute business logic doSomething(); } finally { // Release lock lock.release(); } } } catch (Exception e) { logger.error("Failed to acquire lock", e); }
Lock Considerations:
- Set reasonable timeout
- Ensure lock release (use finally)
- Avoid deadlock
- Consider lock reentrancy
5. Configuration Center Best Practices
Configuration Storage:
java// Configuration path design /app/{service}/{env}/{key} // Examples /app/payment/prod/database.url /app/payment/prod/database.username // Configuration version control /app/payment/prod/config.v1 /app/payment/prod/config.v2
Configuration Update:
java// Use Watcher to monitor configuration changes zk.getData("/config", event -> { if (event.getType() == Event.EventType.NodeDataChanged) { // Reload configuration reloadConfig(); } }, null); // Use version number for atomic update Stat stat = new Stat(); zk.getData("/config", false, stat); zk.setData("/config", newData, stat.getVersion());
6. Service Registration Discovery Best Practices
Service Registration:
java// Register when service starts String servicePath = "/services/" + serviceName + "/" + instanceId; String instanceData = JSON.toJSONString(instanceInfo); zk.create(servicePath, instanceData.getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
Service Discovery:
java// Get service instance list String servicePath = "/services/" + serviceName; List<String> instances = zk.getChildren(servicePath, event -> { // Re-fetch when service instances change discoverServices(); }); // Load balancing String selectedInstance = loadBalance(instances);
7. Performance Optimization Best Practices
Batch Operations:
java// Use multi operation to reduce network round trips List<Op> ops = new ArrayList<>(); ops.add(Op.create("/path1", data1, ...)); ops.add(Op.create("/path2", data2, ...)); ops.add(Op.setData("/path3", data3, ...)); zk.multi(ops);
Read Optimization:
java// Use Observer nodes to handle read requests // Reduce Leader load // Use sync() to ensure data consistency zk.sync("/path", (rc, path, ctx) -> { zk.getData("/path", false, stat); }, null);
Connection Optimization:
java// Reasonably set connection pool size // Avoid frequent creation and destruction of connections // Use long connections // Reduce TCP handshake overhead
8. Security Best Practices
ACL Configuration:
java// Set ACL when creating nodes List<ACL> acls = new ArrayList<>(); acls.add(new ACL(Perms.READ, new Id("digest", "user:password"))); acls.add(new ACL(Perms.ALL, new Id("auth", "admin:admin"))); zk.create("/secure", data, acls, CreateMode.PERSISTENT);
Authentication Configuration:
java// Add authentication info zk.addAuthInfo("digest", "username:password".getBytes()); // Use SASL authentication System.setProperty("java.security.auth.login.config", "jaas.conf");
9. Monitoring Best Practices
Key Metrics Monitoring:
bash# 1. Latency metrics echo mntr | nc localhost 2181 | grep latency # 2. Throughput metrics echo mntr | nc localhost 2181 | grep packets # 3. Connection count metrics echo cons | nc localhost 2181 | wc -l # 4. Watcher count echo wchs | nc localhost 2181
Alert Configuration:
yaml# Latency alert - alert: ZookeeperHighLatency expr: zookeeper_avg_latency > 10 for: 5m # Connection count alert - alert: ZookeeperHighConnections expr: zookeeper_num_alive_connections > 1000 for: 5m
10. Backup Recovery Best Practices
Regular Backup:
bash#!/bin/bash # Daily backup BACKUP_DIR=/backup/zookeeper/$(date +%Y%m%d) mkdir -p $BACKUP_DIR # Backup transaction logs cp -r /data/zookeeper/logs $BACKUP_DIR/ # Backup snapshot files cp -r /data/zookeeper/data/version-2 $BACKUP_DIR/ # Compress backup tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR/ # Clean old backups (keep 7 days) find /backup/zookeeper -mtime +7 -delete
Recovery Verification:
bash# 1. Verify backup in test environment # 2. Regular recovery drills # 3. Record recovery steps # 4. Update recovery documentation
11. Version Management Best Practices
Version Selection:
- Use LTS version
- Pay attention to security patches
- Test before upgrade
- Rolling upgrade strategy
Upgrade Process:
bash# 1. Backup data # 2. Verify in test environment # 3. Rolling upgrade Followers # 4. Finally upgrade Leader # 5. Verify cluster status
12. Fault Handling Best Practices
Fault Plan:
- Develop detailed fault handling process
- Regular fault drills
- Establish emergency response mechanism
- Record fault handling experience
Quick Recovery:
bash# 1. Quickly locate problem # 2. Switch to backup node # 3. Recover data # 4. Verify service # 5. Analyze root cause
13. Development Standards
Code Standards:
java// 1. Unified exception handling // 2. Comprehensive logging // 3. Reasonable retry mechanism // 4. Proper resource release
Testing Standards:
java// 1. Unit testing // 2. Integration testing // 3. Stress testing // 4. Fault testing
14. Documentation Standards
Required Documentation:
- Architecture design documentation
- API documentation
- Operations manual
- Troubleshooting guide
- Change records
15. Team Collaboration
Knowledge Sharing:
- Regular technical sharing
- Build knowledge base
- Code review
- Best practices summary