How to remove duplicate documents from a search in Elasticsearch

Identifying and removing duplicate documents in Elasticsearch search results is a common requirement, especially during data integration or data cleaning processes. Typically, the concept of 'duplicates' can be defined based on a specific field or a combination of multiple fields. Here is one method to identify and remove these duplicate documents:

Step 1: Use Aggregation to Identify Duplicate Documents

Assume we want to identify duplicate documents based on a field (e.g., title). We can use Elasticsearch's aggregation feature to find which title values appear multiple times.

json
GET /your_index/_search
{
  "size": 0,
  "aggs": {
    "duplicate_titles": {
      "terms": {
        "field": "title.keyword",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicate_documents": {
          "top_hits": {
            "size": 10
          }
        }
      }
    }
  }
}

This query does not return standard search results for documents ("size": 0), but instead returns an aggregation named duplicate_titles that lists all title values appearing two or more times (set via "min_doc_count": 2). For each such title, the "top_hits" aggregation will return detailed information for up to 10 documents with that title.

Step 2: Delete Duplicate Documents Based on Requirements

Once we have the specific information about duplicate documents, the next step is to decide how to handle these duplicates. If you want to automatically delete these duplicates, you typically need a script or program to parse the results of the above aggregation query and perform the deletion.

Here is a simple method to delete all duplicate documents except the most recent one (assuming each document has a timestamp field):

python
from elasticsearch import Elasticsearch

es = Elasticsearch()

response = es.search(index="your_index", body={
  "size": 0,
  "aggs": {
    "duplicate_titles": {
      "terms": {
        "field": "title.keyword",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicate_documents": {
          "top_hits": {
            "size": 10,
            "sort": [
              {
                "timestamp": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
})

for title_bucket in response['aggregations']['duplicate_titles']['buckets']:
    docs = title_bucket['duplicate_documents']['hits']['hits']
    # Skip the first document as it is the most recent
    for doc in docs[1:]:  # Skip the first document as it is the most recent
        es.delete(index="your_index", id=doc['_id'])

Notes

Before deleting documents, ensure you back up relevant data to prevent accidental deletion of important data.
Considering performance issues, it's best to perform such operations during off-peak hours for large indices.
Adjust the above method based on specific business requirements, for example, you may need to define duplicates based on different field combinations.

This way, we can effectively identify and remove duplicate documents in Elasticsearch.

2024年6月29日 12:07 回复

1个答案

Step 1: Use Aggregation to Identify Duplicate Documents

Step 2: Delete Duplicate Documents Based on Requirements

Notes

你的答案