In Elasticsearch, the storage mechanism for field data is central to performance optimization. When handling large volumes of data, understanding the difference between fielddata and doc_values is crucial, as they directly impact the efficiency of aggregation, sorting, and search operations. Particularly in Elasticsearch 7.0+ versions, fielddata has been deprecated, and it is recommended to prioritize doc_values to avoid out-of-memory (OOM) issues. This article will delve into the technical details, use cases, and best practices of both, helping developers optimize index design.
What are doc_values
doc_values is Elasticsearch's default field storage mechanism, used for storing field data in binary format on disk during indexing. Key characteristics include:
- Storage location: Created during indexing, data written directly to disk without occupying memory (unless explicitly enabled).
- Primary use: Supports efficient aggregation (e.g.,
termsaggregation) and sorting (e.g.,sortquery), as it is designed for columnar storage to enable fast data scanning. - Memory impact: Minimal memory usage, typically only storing index metadata, suitable for large datasets.
- Applicable fields: Default for
keywordtype fields; fortexttype fields, explicitly setdoc_values: trueto enable.
The workflow is as follows:
- During indexing, Elasticsearch converts field values to compressed binary format.
- During search, data is directly read from disk, avoiding memory loading to improve performance.
For example, enabling doc_values in the index mapping:
jsonPUT /my_index { "mappings": { "properties": { "status": { "type": "keyword", "doc_values": true // Default is true }, "content": { "type": "text", "doc_values": true // Must explicitly set } } } }
What are fielddata
fielddata is an older mechanism in Elasticsearch for loading field data into memory during search. Key characteristics include:
- Storage location: Loaded into memory (RAM) on demand during search, not persisted to disk.
- Primary use: Used for sorting, aggregation, and other scenarios requiring memory access, but only for
texttype fields. - Memory impact: High risk! Large datasets can lead to OOM, especially when field values have low repetition or large data volumes.
- Applicable fields: Only for
texttype fields, and must be explicitly enabled (fielddata: true).
The workflow is as follows:
- During search, Elasticsearch loads field values from disk into memory cache.
- After processing the query, the cache may be released, but frequent access can exhaust memory.
For example, enabling fielddata in the index mapping (not recommended):
jsonPUT /my_old_index { "mappings": { "properties": { "text_field": { "type": "text", "fielddata": true // Only necessary in older versions } } } }
Core Difference Analysis
Storage Location and Lifecycle
doc_values: Created during indexing, data stored on disk (e.g., Lucene'sDocValuesformat), lifecycle matches the index, not dependent on search requests.fielddata: Loaded into memory on demand during search, short-lived, exists only during query.
Use Case Comparison
| Feature | doc_values | fielddata |
|---|---|---|
| Performance | Efficient: Columnar storage supports fast scanning, suitable for aggregation and sorting | Inefficient: Memory loading causes latency, especially for large datasets |
| Memory consumption | Low: Only a small fraction of index size | High: Can consume several GB of memory, causing OOM |
| Data type | Suitable for keyword and text (must explicitly set) | Only for text |
| Elasticsearch version | Supported by 7.0+ | Deprecated in 7.0+, only compatible with older versions |
Performance Impact and Risks
doc_values: Significant performance improvement in aggregation queries. For example, executing atermsaggregation on 1 million documents,doc_valuescan reduce query time by over 50%.fielddata: Memory consumption is the primary risk. Experiments show that loading 1 million documents with field values having less than 5% repetition can consume over 2GB of memory (see Elasticsearch official documentation). In Elasticsearch 7.0+,fielddatais marked as@deprecated, and it is recommended to avoid using it.
Key Difference Summary
doc_valuesis precomputed: Prepared during indexing, used directly during search, suitable for persistent scenarios.fielddatais lazy-loaded: Dynamically loaded during search, suitable for temporary operations, but high risk.
Practical Example: Migrating from fielddata to doc_values
Step 1: Check Existing Indexes
First, verify if fielddata is misused. Use the following command to check field configurations:
jsonGET /_cat/indices?v
In the output, check if the index field contains fielddata markers (e.g., fielddata: true).
Step 2: Rewrite Index Mappings
In new indexes, prioritize doc_values:
jsonPUT /new_index { "mappings": { "properties": { "status": { "type": "keyword", "doc_values": true // Not explicitly needed, but ensure enabled }, "description": { "type": "text", "doc_values": true // Must explicitly set } } } }
Step 3: Handle Old Indexes (with caution)
For legacy data, use reindex to migrate:
jsonPOST /_reindex { "source": { "index": "old_index" }, "dest": { "index": "new_index", "doc_type": "_doc" } }
Important note: Before migration, execute GET /old_index/_mapping to confirm field types. Avoid setting doc_values: false directly on text fields, as it disables aggregation functionality.
Step 4: Test Performance
Compare query performance:
jsonGET /new_index/_search { "size": 10, "sort": [{"description": {"order": "asc"}}], "aggs": { "top_terms": { "terms": { "field": "description", "size": 5 } } } }
Observe response time: doc_values typically performs 3-5 times faster than fielddata (based on official benchmark tests).
Recommendations and Best Practices
- Prioritize
doc_values: In all new indexes, ensuretextfields explicitly setdoc_values: trueto avoid usingfielddata. Elasticsearch 7.0+ defaults to disablingfielddata, so explicitly settingdoc_valuesis safe. - Monitor memory: Use the
_nodes/statsAPI to trackfielddatamemory usage:
jsonGET /_nodes/stats/os,indices
If high consumption is detected, migrate fields immediately. 3. Avoid pitfalls:
-
For
textfields, if aggregation is not needed, setdoc_values: falseto save memory (but evaluate search impact). -
Do not enable
fielddataonkeywordfields, as it wastes resources. -
Performance tuning:
- Use
index.max_untracked_fieldsparameter to control memory usage. - For high repetition data, enable
doc_valuescompression (default enabled).
- Use
-
Version upgrade recommendation: In Elasticsearch 7.0+, remove all
fielddataconfigurations. Official documentation clearly states: "fielddatais deprecated and will be removed in future versions" (see Elasticsearch 7.0 Breaking Changes).
Conclusion
The core difference between doc_values and fielddata lies in storage location and memory management: doc_values is an efficient precomputed mechanism during indexing, suitable for production environments; fielddata is a temporary solution during search with high risk and deprecated. Developers should prioritize doc_values and optimize Elasticsearch performance through index mappings, monitoring, and migration strategies. Remember, Elasticsearch 7.0+ is a critical turning point—embracing doc_values not only boosts query speed but also avoids severe memory issues. In actual projects, configure field storage based on actual data scale and query patterns to significantly enhance system robustness.