乐闻世界logo
搜索文章和话题

What are Zookeeper best practices? How to design architecture and data model?

2月21日 16:24

Answer

Zookeeper best practices cover architecture design, development usage, operations management, and other aspects. Following these practices can build stable and efficient distributed systems.

1. Architecture Design Best Practices

Cluster Scale Selection:

  • 3 nodes: Suitable for small-scale applications, allows 1 node failure
  • 5 nodes: Recommended for production, allows 2 node failures
  • 7 nodes: Large-scale applications, allows 3 node failures
  • Avoid even number of nodes: Prevent election deadlock

Node Deployment Strategy:

bash
# 1. Cross-availability zone deployment # Avoid single point of failure # Improve disaster recovery capability # 2. Network isolation # Use dedicated network # Reduce network latency # 3. Resource isolation # Independent servers # Avoid resource contention

Storage Separation:

properties
# Transaction logs use high-performance disk dataLogDir=/data/zookeeper/logs # SSD recommended # Data snapshots use regular disk dataDir=/data/zookeeper/data # HDD acceptable

2. Data Model Design Best Practices

Node Naming Conventions:

java
// Use clear namespace /app/{service-name}/{environment}/{component} // Examples /app/payment/prod/config /app/order/dev/leader /app/user/test/locks

Node Hierarchy Design:

  • Hierarchy should not be too deep (recommended < 5 levels)
  • Avoid too many child nodes (recommended < 1000)
  • Reasonably group related nodes

Data Size Control:

java
// Single node data < 1MB // Shard large data storage // Wrong example zk.create("/big-data", largeData, ...); // Data too large // Correct example for (int i = 0; i < chunks; i++) { String path = "/data/chunk-" + i; zk.create(path, chunkData[i], ...); }

Node Type Selection:

java
// Configuration data: persistent node zk.create("/config", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); // Temporary state: ephemeral node zk.create("/session/123", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // Distributed queue: sequential node zk.create("/queue/item-", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT_SEQUENTIAL);

3. Client Usage Best Practices

Connection Management:

java
// Use connection pool CuratorFramework client = CuratorFrameworkFactory.builder() .connectString("localhost:2181") .sessionTimeoutMs(30000) .connectionTimeoutMs(10000) .retryPolicy(new ExponentialBackoffRetry(1000, 3)) .build(); client.start(); // Use try-with-resources to ensure resource release try (ZooKeeper zk = new ZooKeeper(...)) { // Use zk }

Exception Handling:

java
try { zk.create("/path", data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); } catch (KeeperException.NodeExistsException e) { // Node already exists logger.warn("Node already exists"); } catch (KeeperException.ConnectionLossException e) { // Connection lost, need retry retry(); } catch (InterruptedException e) { Thread.currentThread().interrupt(); }

Watcher Usage:

java
// One-time Watcher, avoid leaks zk.getData("/path", new Watcher() { @Override public void process(WatchedEvent event) { // Handle event handleEvent(event); // Re-register try { zk.getData("/path", this, null); } catch (Exception e) { logger.error("Failed to re-register watcher", e); } } }, null); // Avoid time-consuming operations in Watcher zk.getData("/path", event -> { // Use async processing executor.submit(() -> { processEvent(event); }); }, null);

4. Distributed Lock Best Practices

Lock Implementation:

java
// Use Curator's distributed lock InterProcessMutex lock = new InterProcessMutex(client, "/locks/my-lock"); try { // Acquire lock (with timeout) if (lock.acquire(10, TimeUnit.SECONDS)) { try { // Execute business logic doSomething(); } finally { // Release lock lock.release(); } } } catch (Exception e) { logger.error("Failed to acquire lock", e); }

Lock Considerations:

  • Set reasonable timeout
  • Ensure lock release (use finally)
  • Avoid deadlock
  • Consider lock reentrancy

5. Configuration Center Best Practices

Configuration Storage:

java
// Configuration path design /app/{service}/{env}/{key} // Examples /app/payment/prod/database.url /app/payment/prod/database.username // Configuration version control /app/payment/prod/config.v1 /app/payment/prod/config.v2

Configuration Update:

java
// Use Watcher to monitor configuration changes zk.getData("/config", event -> { if (event.getType() == Event.EventType.NodeDataChanged) { // Reload configuration reloadConfig(); } }, null); // Use version number for atomic update Stat stat = new Stat(); zk.getData("/config", false, stat); zk.setData("/config", newData, stat.getVersion());

6. Service Registration Discovery Best Practices

Service Registration:

java
// Register when service starts String servicePath = "/services/" + serviceName + "/" + instanceId; String instanceData = JSON.toJSONString(instanceInfo); zk.create(servicePath, instanceData.getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);

Service Discovery:

java
// Get service instance list String servicePath = "/services/" + serviceName; List<String> instances = zk.getChildren(servicePath, event -> { // Re-fetch when service instances change discoverServices(); }); // Load balancing String selectedInstance = loadBalance(instances);

7. Performance Optimization Best Practices

Batch Operations:

java
// Use multi operation to reduce network round trips List<Op> ops = new ArrayList<>(); ops.add(Op.create("/path1", data1, ...)); ops.add(Op.create("/path2", data2, ...)); ops.add(Op.setData("/path3", data3, ...)); zk.multi(ops);

Read Optimization:

java
// Use Observer nodes to handle read requests // Reduce Leader load // Use sync() to ensure data consistency zk.sync("/path", (rc, path, ctx) -> { zk.getData("/path", false, stat); }, null);

Connection Optimization:

java
// Reasonably set connection pool size // Avoid frequent creation and destruction of connections // Use long connections // Reduce TCP handshake overhead

8. Security Best Practices

ACL Configuration:

java
// Set ACL when creating nodes List<ACL> acls = new ArrayList<>(); acls.add(new ACL(Perms.READ, new Id("digest", "user:password"))); acls.add(new ACL(Perms.ALL, new Id("auth", "admin:admin"))); zk.create("/secure", data, acls, CreateMode.PERSISTENT);

Authentication Configuration:

java
// Add authentication info zk.addAuthInfo("digest", "username:password".getBytes()); // Use SASL authentication System.setProperty("java.security.auth.login.config", "jaas.conf");

9. Monitoring Best Practices

Key Metrics Monitoring:

bash
# 1. Latency metrics echo mntr | nc localhost 2181 | grep latency # 2. Throughput metrics echo mntr | nc localhost 2181 | grep packets # 3. Connection count metrics echo cons | nc localhost 2181 | wc -l # 4. Watcher count echo wchs | nc localhost 2181

Alert Configuration:

yaml
# Latency alert - alert: ZookeeperHighLatency expr: zookeeper_avg_latency > 10 for: 5m # Connection count alert - alert: ZookeeperHighConnections expr: zookeeper_num_alive_connections > 1000 for: 5m

10. Backup Recovery Best Practices

Regular Backup:

bash
#!/bin/bash # Daily backup BACKUP_DIR=/backup/zookeeper/$(date +%Y%m%d) mkdir -p $BACKUP_DIR # Backup transaction logs cp -r /data/zookeeper/logs $BACKUP_DIR/ # Backup snapshot files cp -r /data/zookeeper/data/version-2 $BACKUP_DIR/ # Compress backup tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR/ # Clean old backups (keep 7 days) find /backup/zookeeper -mtime +7 -delete

Recovery Verification:

bash
# 1. Verify backup in test environment # 2. Regular recovery drills # 3. Record recovery steps # 4. Update recovery documentation

11. Version Management Best Practices

Version Selection:

  • Use LTS version
  • Pay attention to security patches
  • Test before upgrade
  • Rolling upgrade strategy

Upgrade Process:

bash
# 1. Backup data # 2. Verify in test environment # 3. Rolling upgrade Followers # 4. Finally upgrade Leader # 5. Verify cluster status

12. Fault Handling Best Practices

Fault Plan:

  • Develop detailed fault handling process
  • Regular fault drills
  • Establish emergency response mechanism
  • Record fault handling experience

Quick Recovery:

bash
# 1. Quickly locate problem # 2. Switch to backup node # 3. Recover data # 4. Verify service # 5. Analyze root cause

13. Development Standards

Code Standards:

java
// 1. Unified exception handling // 2. Comprehensive logging // 3. Reasonable retry mechanism // 4. Proper resource release

Testing Standards:

java
// 1. Unit testing // 2. Integration testing // 3. Stress testing // 4. Fault testing

14. Documentation Standards

Required Documentation:

  1. Architecture design documentation
  2. API documentation
  3. Operations manual
  4. Troubleshooting guide
  5. Change records

15. Team Collaboration

Knowledge Sharing:

  • Regular technical sharing
  • Build knowledge base
  • Code review
  • Best practices summary
标签:Zookeeper