What are the Best Practices for Configuring and Scaling Elasticsearch Clusters? - 面试题

Node Role Separation and Configuration

In Elasticsearch, proper allocation of node roles (such as master, data, and coordinating) is crucial for avoiding single points of failure and resource wastage. Master nodes manage cluster metadata, data nodes store index data, and coordinating nodes handle client requests. Incorrect role assignment can lead to performance bottlenecks or data loss.

Configuration Principles:
- Strictly separate roles: In production, recommend at least 3 master nodes (to avoid split-brain scenarios), and keep data nodes separate from coordinating nodes.
- Configure roles via elasticsearch.yml:

yaml
# Example: Data node configuration
node.roles: [data, ingest]  # Avoid master node roles
node.attr: {data: true}

# Master node configuration
node.roles: [master, data]  # Recommend no more than 3 nodes
node.attr: {master: true}

Practical Recommendations: Use xpack.security for security, avoid assigning all roles to a single node. Monitor metrics including cluster-health status and nodes node load.

Shard and Replica Optimization

Shards split indices into parallel units, and replicas provide redundancy. Incorrect configuration can lead to performance degradation or data unavailability.

Key Parameters:
- number_of_shards: Recommend 3-5 (avoid too few causing hotspots, too many increasing overhead).
- number_of_replicas: Set to 1 or 2 in production (avoid 0 causing single points of failure).
- Shard Size: Single shard should not exceed 50GB (refer to Elasticsearch official documentation Shard Size Guidelines).
Configuration Example:

json
PUT /logs_index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "index.refresh_interval": "1s"  // Reduce refresh frequency to improve write performance
  }
}

Practical Recommendations:
1. Set index.codec=best_compression for critical indices to save storage.
2. Use PUT /_cluster/settings to dynamically adjust replicas:

json
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

Avoid creating too many indices on a single node (exceeding 100 can cause performance issues).

Index Lifecycle Management (ILM)

Index Lifecycle Management is central to scaling strategies. Unmanaged indices can lead to storage explosion and query latency.

Best Practices:
- Phase Division:
  - Hot Phase: Active data with high write volume; set index.lifecycle.ILM.rollover_alias.
  - Warm Phase: Archived data with reduced query frequency; use index.lifecycle.ILM.rollover.
  - Cold Phase: Read-only data; migrate to low-cost nodes.
- Configuration Example:

json
PUT /_ilm/policy/log_policy
{
  "policy": {
    "description": "Log index lifecycle",
    "schema": {
      "description": "Rollover on size",
      "rollover": {
        "max_size": "50gb",
        "max_age": "30d"
      }
    }
  }
}

Scaling Strategies:
- Use ILM to automatically roll over indices, avoiding manual management.
- Monitor indexing_rate metrics; trigger scaling when write volume exceeds thresholds.
- Practical Recommendations: Combine with Kibana's Lens tool to analyze index distribution and ensure data balance.

Cluster Scaling and Balancing

Horizontal scaling requires careful execution to avoid data skew.

Scaling Steps:
1. Add new nodes:

bash
# Ensure new node configuration is consistent (elasticsearch.yml)
curl -XPUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{"transient":{"cluster.routing.allocation.enable":"all"}}'

Monitor balancing: Use GET /_cat/shards?v to confirm shard distribution.
Avoid Issues:
- Adding too many nodes at once can cause shard migration storms.
- Ensure new nodes have similar hardware (CPU/RAM/SSD) to existing nodes.
Performance Optimization:
- Configure indices.cache.request.enable: true for data nodes to improve cache hit rates.
- Set cluster.routing.allocation.enable: all to allow automatic rebalancing.
- Practical Recommendations: Use cluster reroute command to manually adjust shard locations:

json
POST /_cluster/reroute
{
  "commands": [
    {
      "allocate": {
        "index": "logs_index",
        "shard": 0,
        "node": "node_3",
        "accept_data_loss": false
      }
    }
  ]
}

Monitoring and Alerting System

Real-time monitoring is essential for successful scaling.

Core Tools:
- Kibana: Visualize cluster health (GET /_cluster/health), monitor metrics including status (green/yellow/red) and docs.count.
- Elastic Stack: Set up alerting rules (e.g., notify when disk_usage > 85%).
Practical Recommendations:
1. Use GET /_nodes/stats to retrieve node statistics.
2. Regularly run GET /_cluster/health?pretty to check status.
3. Avoid Common Pitfalls:
  - Do not set cluster.routing.allocation.enable to all unless necessary (may cause data inconsistency).
  - Monitor search_phase_execution_time to avoid query timeouts.

Conclusion

The best practices for configuring and scaling Elasticsearch clusters revolve around systematic design and dynamic optimization: role separation, proper shard and replica settings, ILM management, and monitoring/alerting are core. Production recommendations:

Priority: Ensure cluster health (green status) before scaling capacity.
Continuous Improvement: Regularly use cluster stats to analyze performance bottlenecks, adjust configurations based on log analysis.
Security Note: Enable xpack.security to protect the cluster, prevent unauthorized access.

By following these practices, system reliability can be significantly improved. Refer to the Elasticsearch Official Guide for deeper exploration, or use Docker Compose to quickly deploy a test environment.