How Elasticsearch Implements Cross-Cluster Replication (CCR)? - 面试题

Elasticsearch Cross-Cluster Replication (CCR) is a core feature introduced in Elasticsearch 7.10.0, designed to synchronize data across different clusters, ensuring data consistency and high availability. It addresses data silo issues in distributed systems through a leader cluster (Leader Cluster) and follower cluster (Follower Cluster) architecture, particularly suitable for multi-region deployments. This article will delve into the implementation principles, configuration steps, and best practices of CCR to help developers efficiently build cross-cluster data streams.

What is Elasticsearch Cross-Cluster Replication (CCR)?

CCR is a unidirectional data replication mechanism that allows one cluster (source cluster) to synchronize data in real-time to another cluster (target cluster). Its core design principle is unidirectional replication: the source cluster acts as the leader, and the target cluster as the follower, with data flow moving from the leader to the follower. This differs from traditional master-slave replication, as CCR abstracts network isolation through the Remote Cluster concept, avoiding direct exposure of internal network structures.

Key components include:

Leader Cluster: Data source cluster, configured with remote.cluster pointing to the target cluster.
Follower Cluster: Data receiving cluster, configured with remote.cluster pointing to the source cluster.
Replication Stream: Data synchronization channel, using Sequence Numbers to ensure data order.

CCR offers advantages such as:

Low-latency synchronization: Data written to the leader is quickly transmitted to the follower via lightweight protocols.
High availability: Avoids single points of failure, supporting cross-region disaster recovery.
Resource optimization: Only replicates new data, reducing bandwidth consumption.

1. Remote Cluster Configuration

CCR's foundation is Remote Cluster Registration. The source cluster must configure the target cluster's metadata via elasticsearch.yml:

yaml
# Source cluster configuration (leader cluster)
cluster.remote.cluster1.remote.cluster: "follower-cluster"
cluster.remote.cluster1.remote.hosts: ["follower-cluster-node1:9300", "follower-cluster-node2:9300"]

The target cluster (follower cluster) must register the source cluster:

yaml
# Target cluster configuration
cluster.remote.cluster2.remote.cluster: "leader-cluster"
cluster.remote.cluster2.remote.hosts: ["leader-cluster-node1:9300"]

Note: The cluster.remote.cluster value must be unique and match both sides. Incorrect configuration leads to connection failures, verified via GET /_remote/info API.

2. Index-level Replication Configuration

CCR operates at the index level, requiring explicit enablement. When creating an index, specify via remote parameter:

json
PUT /my-index/_create
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "remote": {
        "cluster": "follower-cluster"
      }
    }
  }
}

Key parameters:

index.remote.cluster: Specifies the follower cluster name (must match cluster.remote).
index.remote.index: Specifies the target index name (defaults to the source index).

3. Data Synchronization Process

Data synchronization occurs in three stages:

Data Writing: Client writes to the leader cluster; Elasticsearch generates Sequence Numbers.
Stream Transmission: Data packets are sent to the follower via Remote Cluster API (e.g., POST /_remote/leader/_replicate).
Acknowledgement: After confirmation, the follower returns acknowledged status.

CCR Data Synchronization Process

Important note: CCR uses a snapshot mechanism to prevent data loss. If the follower cluster has high latency, data is temporarily stored in the _remote index, ensuring write consistency.

Practical Configuration: Setting Up CCR Clusters

The following steps demonstrate CCR configuration in production environments.

Step 1: Initialize Remote Clusters

On the leader cluster (using curl):

bash
# Register follower cluster
curl -X PUT "http://leader-cluster:9200/_remote/cluster/follower-cluster" -H 'Content-Type: application/json' -d '{"cluster_id":"follower-cluster"}'

# Verify connection
curl -X GET "http://leader-cluster:9200/_remote/info?cluster=follower-cluster"

Step 2: Configure Index Replication

On the leader cluster, create the index and enable CCR:

json
PUT /my-index/_settings
{
  "index": {
    "remote": {
      "cluster": "follower-cluster",
      "index": "my-index"
    }
  }
}

On the follower cluster, create the index:

json
PUT /my-index
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    }
  }
}

Step 3: Start Data Replication

Start CCR stream via API:

json
POST /_ccr/remote/leader/_replicate?index=my-index
{
  "remote": {
    "cluster": "follower-cluster"
  }
}

Verify synchronization status: Use GET /_ccr/remote/leader/_state?index=my-index to check progress. Status code "state":"syncing" indicates normal synchronization.

Step 4: Monitoring and Troubleshooting

Monitoring metrics: Check bytes_in and bytes_out in the index.remote index via Kibana or Elasticsearch API.
Common issues:
- Network problems: Verify firewall rules to ensure port 9300 is open.
- High latency: Adjust max_replication_delay parameter for index.remote.cluster (default 300s).
- Data conflicts: Use GET /_ccr/remote/leader/_state?index=my-index to detect conflicts field.

Best Practices and Recommendations

Network configuration: Ensure low-latency, high-bandwidth connections between clusters. Use VPC networks for isolation to avoid public internet risks.
Data volume management: Only replicate necessary indices. Avoid enabling CCR in high-write scenarios to prevent blocking write threads.
Security hardening: Encrypt remote connections with TLS (enable xpack.security), and set access controls for remote.cluster.
Disaster recovery design: Configure multiple replicas on the follower cluster to avoid single points of failure. For example, set index.number_of_replicas: 2.
Test environment: Validate CCR in development clusters first. Test synchronization streams using curl:

bash
curl -X POST "http://leader-cluster:9200/_ccr/remote/leader/_replicate?index=my-index" -H 'Content-Type: application/json' -d '{"index": "my-index"}'

Conclusion

Elasticsearch CCR achieves efficient and reliable cross-cluster data replication through sequence number-driven and remote cluster registration mechanisms. It is suitable for cloud-native architectures and multi-region deployments, significantly enhancing system resilience. Developers should follow the configure network, enable index, monitor and verify workflow to avoid common pitfalls. For large-scale production environments, integrate Elasticsearch Monitoring tools (e.g., monitoring plugin) to continuously track synchronization health. With proper configuration, CCR can serve as the core foundation for building distributed data platforms.

Reference resources: Elasticsearch Official CCR Documentation