乐闻世界logo
搜索文章和话题

What's the Difference Between Elasticsearch's fielddata and doc_values?

2月22日 14:50

In Elasticsearch, the storage mechanism for field data is central to performance optimization. When handling large volumes of data, understanding the difference between fielddata and doc_values is crucial, as they directly impact the efficiency of aggregation, sorting, and search operations. Particularly in Elasticsearch 7.0+ versions, fielddata has been deprecated, and it is recommended to prioritize doc_values to avoid out-of-memory (OOM) issues. This article will delve into the technical details, use cases, and best practices of both, helping developers optimize index design.

What are doc_values

doc_values is Elasticsearch's default field storage mechanism, used for storing field data in binary format on disk during indexing. Key characteristics include:

  • Storage location: Created during indexing, data written directly to disk without occupying memory (unless explicitly enabled).
  • Primary use: Supports efficient aggregation (e.g., terms aggregation) and sorting (e.g., sort query), as it is designed for columnar storage to enable fast data scanning.
  • Memory impact: Minimal memory usage, typically only storing index metadata, suitable for large datasets.
  • Applicable fields: Default for keyword type fields; for text type fields, explicitly set doc_values: true to enable.

The workflow is as follows:

  1. During indexing, Elasticsearch converts field values to compressed binary format.
  2. During search, data is directly read from disk, avoiding memory loading to improve performance.

For example, enabling doc_values in the index mapping:

json
PUT /my_index { "mappings": { "properties": { "status": { "type": "keyword", "doc_values": true // Default is true }, "content": { "type": "text", "doc_values": true // Must explicitly set } } } }

What are fielddata

fielddata is an older mechanism in Elasticsearch for loading field data into memory during search. Key characteristics include:

  • Storage location: Loaded into memory (RAM) on demand during search, not persisted to disk.
  • Primary use: Used for sorting, aggregation, and other scenarios requiring memory access, but only for text type fields.
  • Memory impact: High risk! Large datasets can lead to OOM, especially when field values have low repetition or large data volumes.
  • Applicable fields: Only for text type fields, and must be explicitly enabled (fielddata: true).

The workflow is as follows:

  1. During search, Elasticsearch loads field values from disk into memory cache.
  2. After processing the query, the cache may be released, but frequent access can exhaust memory.

For example, enabling fielddata in the index mapping (not recommended):

json
PUT /my_old_index { "mappings": { "properties": { "text_field": { "type": "text", "fielddata": true // Only necessary in older versions } } } }

Core Difference Analysis

Storage Location and Lifecycle

  • doc_values: Created during indexing, data stored on disk (e.g., Lucene's DocValues format), lifecycle matches the index, not dependent on search requests.
  • fielddata: Loaded into memory on demand during search, short-lived, exists only during query.

Use Case Comparison

Featuredoc_valuesfielddata
PerformanceEfficient: Columnar storage supports fast scanning, suitable for aggregation and sortingInefficient: Memory loading causes latency, especially for large datasets
Memory consumptionLow: Only a small fraction of index sizeHigh: Can consume several GB of memory, causing OOM
Data typeSuitable for keyword and text (must explicitly set)Only for text
Elasticsearch versionSupported by 7.0+Deprecated in 7.0+, only compatible with older versions

Performance Impact and Risks

  • doc_values: Significant performance improvement in aggregation queries. For example, executing a terms aggregation on 1 million documents, doc_values can reduce query time by over 50%.
  • fielddata: Memory consumption is the primary risk. Experiments show that loading 1 million documents with field values having less than 5% repetition can consume over 2GB of memory (see Elasticsearch official documentation). In Elasticsearch 7.0+, fielddata is marked as @deprecated, and it is recommended to avoid using it.

Key Difference Summary

  • doc_values is precomputed: Prepared during indexing, used directly during search, suitable for persistent scenarios.
  • fielddata is lazy-loaded: Dynamically loaded during search, suitable for temporary operations, but high risk.

Practical Example: Migrating from fielddata to doc_values

Step 1: Check Existing Indexes

First, verify if fielddata is misused. Use the following command to check field configurations:

json
GET /_cat/indices?v

In the output, check if the index field contains fielddata markers (e.g., fielddata: true).

Step 2: Rewrite Index Mappings

In new indexes, prioritize doc_values:

json
PUT /new_index { "mappings": { "properties": { "status": { "type": "keyword", "doc_values": true // Not explicitly needed, but ensure enabled }, "description": { "type": "text", "doc_values": true // Must explicitly set } } } }

Step 3: Handle Old Indexes (with caution)

For legacy data, use reindex to migrate:

json
POST /_reindex { "source": { "index": "old_index" }, "dest": { "index": "new_index", "doc_type": "_doc" } }

Important note: Before migration, execute GET /old_index/_mapping to confirm field types. Avoid setting doc_values: false directly on text fields, as it disables aggregation functionality.

Step 4: Test Performance

Compare query performance:

json
GET /new_index/_search { "size": 10, "sort": [{"description": {"order": "asc"}}], "aggs": { "top_terms": { "terms": { "field": "description", "size": 5 } } } }

Observe response time: doc_values typically performs 3-5 times faster than fielddata (based on official benchmark tests).

Recommendations and Best Practices

  1. Prioritize doc_values: In all new indexes, ensure text fields explicitly set doc_values: true to avoid using fielddata. Elasticsearch 7.0+ defaults to disabling fielddata, so explicitly setting doc_values is safe.
  2. Monitor memory: Use the _nodes/stats API to track fielddata memory usage:
json
GET /_nodes/stats/os,indices

If high consumption is detected, migrate fields immediately. 3. Avoid pitfalls:

  • For text fields, if aggregation is not needed, set doc_values: false to save memory (but evaluate search impact).

  • Do not enable fielddata on keyword fields, as it wastes resources.

  • Performance tuning:

    • Use index.max_untracked_fields parameter to control memory usage.
    • For high repetition data, enable doc_values compression (default enabled).
  • Version upgrade recommendation: In Elasticsearch 7.0+, remove all fielddata configurations. Official documentation clearly states: "fielddata is deprecated and will be removed in future versions" (see Elasticsearch 7.0 Breaking Changes).

Conclusion

The core difference between doc_values and fielddata lies in storage location and memory management: doc_values is an efficient precomputed mechanism during indexing, suitable for production environments; fielddata is a temporary solution during search with high risk and deprecated. Developers should prioritize doc_values and optimize Elasticsearch performance through index mappings, monitoring, and migration strategies. Remember, Elasticsearch 7.0+ is a critical turning point—embracing doc_values not only boosts query speed but also avoids severe memory issues. In actual projects, configure field storage based on actual data scale and query patterns to significantly enhance system robustness.

标签:ElasticSearch