How to Migrate and Rebuild Index Data in Elasticsearch?

In daily operations of Elasticsearch, migrating and rebuilding index data are common requirements, especially in scenarios such as data architecture upgrades, cluster scaling, or disaster recovery. For instance, when migrating an older version index to a new cluster version, or when rebuilding an index due to changes in storage policies, improper handling may result in data loss or service disruption. This article will delve into Elasticsearch's officially recommended methods for migration and rebuilding, combined with practical examples and code snippets, providing actionable solutions. According to the Elasticsearch official documentation, Index Migration refers to copying data from one index to another, while Index Rebuild focuses on reorganizing the data structure or content. Both require prioritizing data consistency and performance impact.

Main Content

1. Core Methodologies for Migration and Rebuilding

Elasticsearch provides three primary approaches: _reindex API (real-time data replication), Snapshot and Restore (snapshot backup and recovery), and Pipeline (data transformation). When selecting, consider the scenario: for small data volumes requiring low latency, recommend using _reindex; for large-scale clusters or version compatibility, Snapshot and Restore is safer. Key principles include:

Data Consistency Assurance: When using _reindex, manage concurrency via the request_cache parameter to avoid data conflicts.
Performance Optimization: For large indices, enable _reindex's refresh_policy to none to reduce I/O pressure.
Security Validation: After migration, perform validation checks to ensure data integrity.

2. Detailed Implementation Steps

2.1 Using `_reindex API` for Data Migration

_reindex API is a core tool in Elasticsearch 7.0+ versions, supporting incremental and full migrations. The migration steps are:

Prepare Source Index: Ensure the source index (e.g., old_index) is correctly configured with mappings and settings.
Execute Migration Command: Copy data to the target index (e.g., new_index) via HTTP request. Example code:

json
POST /_reindex
{
  "source": {
    "index": "old_index"
  },
  "dest": {
    "index": "new_index",
    "op_type": "create"
  },
  "conflicts": "proceed",
  "requests_per_second": 10
}

Key Parameters: Set op_type to create to ensure data overwrite; set conflicts to proceed to allow duplicate data; control throughput with requests_per_second to avoid overload.
Verify Results: Check the response for total and failed fields to ensure data completeness. For example:

json
{"took": 5000, "total": 100000, "updated": 100000, "failed": 0}

Practical Advice: For indices with over 1 million documents, recommend batch migration. Use the scroll parameter (e.g., "scroll": "5m") to improve efficiency for large datasets.

2.2 Using `Snapshot and Restore` for Index Rebuilding

When rebuilding an index completely (e.g., for version upgrades or index structure changes), Snapshot and Restore is preferred. It achieves zero data loss migration through snapshot mechanisms:

Create Snapshot Repository: First configure the storage repository (e.g., S3 or local path):

json
PUT /_snapshot/my_repository
{
  "type": "fs",
  "settings": {
    "location": "/mnt/snapshots"
  }
}

Generate Source Index Snapshot:

json
PUT /_snapshot/my_repository/old_snapshot
{
  "indices": "old_index",
  "include_hidden": false,
  "ignore_unavailable": true
}

Restore to New Index:

json
POST /_snapshot/my_repository/old_snapshot/_restore
{
  "indices": "new_index",
  "include_hidden": false,
  "rename_pattern": "old_index",
  "rename_replace": "new_index"
}

Advantages: Snapshots support incremental recovery, avoiding full copy overhead; the rename_pattern parameter enables index renaming.

2.3 Advanced Data Transformation and Rebuilding

If data format conversion is needed during migration (e.g., field mapping changes), combine with Ingest Pipeline:

Define Transformation Pipeline: Create pipeline definition, e.g., convert old field old_field to new_field:

json
PUT _ingest/pipeline/rebuild_pipeline
{
  "description": "Rebuild index with field transformation",
  "processors": [
    {"set": {"field": "new_field", "value": "{{_source.old_field}}"}}
  ]
}

Integrate with _reindex: Reference the pipeline in the migration request:

json
POST /_reindex
{
  "source": {
    "index": "old_index"
  },
  "dest": {
    "index": "new_index",
    "pipeline": "rebuild_pipeline"
  }
}

3. Practical Considerations

Performance Monitoring: During migration, use _nodes/stats to monitor cluster load in real-time, avoiding disk.watermark.low triggering alerts.
Data Consistency Verification: After migration, run _search queries to compare document counts, e.g.:

json
GET /new_index/_count
{
  "query": {
    "match_all": {}
  }
}

Security Risks: Before production operations, validate scripts in a test cluster; use _security API to ensure permission control.
Error Handling: If _reindex fails, roll back using the _refresh parameter of _reindex:

json
POST /_reindex
{
  "source": {"index": "old_index"},
  "dest": {"index": "new_index", "refresh": "wait_for"}
}

Professional Insight: According to the Elasticsearch official guide (Elasticsearch Index Migration Guide), migration should always be performed during off-peak hours to minimize impact on search performance. For billion-scale indices, recommend using the _search_after parameter of _reindex for paginated processing.

Conclusion

Migrating and rebuilding index data in Elasticsearch is a critical task in operations, requiring the combination of _reindex API, Snapshot and Restore, and Pipeline tools to ensure data security and efficiency. Through the code examples and practical advice provided in this article, developers can systematically handle migration processes: first verify the source index structure, then select the appropriate method, and finally rigorously test results. Remember, data consistency is the core goal—avoid skipping validation steps to prevent production incidents. For high-load scenarios, recommend using monitoring tools like Elastic APM to track metrics and regularly practice recovery procedures. Ultimately, Elasticsearch migration strategies should align with business needs for seamless upgrades.