What are the field types in Elasticsearch and how to choose the appropriate ones? - 面试题

1. Elasticsearch Field Types Overview

Elasticsearch, as a distributed search and analytics engine, has field type design that directly impacts indexing performance, query efficiency, and data accuracy. Incorrect field type selection can result in tokenization errors, aggregation failures, or wasted storage. This article systematically analyzes Elasticsearch's core field types and provides practical selection guidelines based on real-world scenarios to help developers build efficient and reliable search applications.

1.1 Common Field Types

Elasticsearch provides rich built-in types, categorized as follows:

Core text types: text (for full-text search) and keyword (for exact matching) are fundamental for handling text data.
Numeric types: integer, long, float, double for numerical calculations.
Boolean type: boolean for binary values.
Date-time type: date for time-series analysis.
Special types: ip (IP addresses), object (nested objects), nested (complex nested structures), etc.

Note: Elasticsearch 8.0+ defaults to a combined mode where text fields implicitly include a keyword subfield, but explicit declaration optimizes performance.

1.2 Detailed Type Explanations

Text Type

Purpose: Full-text search, such as for article titles or content.
Characteristics: Automatically tokenized, supports analysis queries (e.g., match), but not exact matching.
Example:

json
"title": {
  "type": "text",
  "analyzer": "standard"
}

Best practices: Use only for tokenization scenarios. Avoid term queries on text fields, as it causes tokenization errors.

Keyword Type

Purpose: Exact matching, such as filtering status or aggregating tags.
Characteristics: Not tokenized, preserves original values, supports term queries and aggregations.
Example:

json
"status": {
  "type": "keyword",
  "ignore_above": 256
}

Best practices: For fields requiring exact matching (e.g., status codes), use keyword. For example:

json
"user_id": {
  "type": "keyword"
}

Avoid using text for user_id queries.

Numeric Types

integer/long: Integers, such as age.
float/double: Floating-point numbers, such as price.
Example:

json
"price": {
  "type": "float",
  "format": "currency"
}

Best practices: Specify precision for numeric fields (e.g., float for currency). Avoid storing numbers in text fields.

Date Type

Purpose: Date-time values, such as log timestamps.
Characteristics: Automatically parses date strings, supports time-range queries.
Example:

json
"created_at": {
  "type": "date",
  "format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
}

Best practices: Specify format to avoid parsing errors. For example, created_at should use date type, not text.

IP Type

Purpose: IP addresses, such as user access sources.
Characteristics: Automatically parses IP addresses, supports network range queries.
Example:

json
"ip_address": {
  "type": "ip"
}

Best practices: Use only for IP address fields. Avoid text for IP filtering, which causes performance degradation.

Nested Type

Purpose: Handle nested objects within arrays, such as product tags.
Characteristics: Prevents flattening, supports independent queries.
Example:

json
"tags": {
  "type": "nested",
  "properties": {
    "name": { "type": "keyword" }
  }
}

Best practices: Use for independent array element queries. For example:

json
"tags": {
  "type": "nested",
  "properties": {
    "tag_name": { "type": "keyword" }
  }
}

Avoid object type to prevent flattening errors.

Elasticsearch Field Type Selection Decision Tree

1.3 How to Choose the Right Field Types

Select field types based on these principles, considering real-world scenarios:

Query requirements first:
- Full-text search: Use text type (e.g., title field).
- Exact matching: Use keyword type (e.g., status field).
- Numeric ranges: Use numeric types (e.g., price field).
- Date filtering: Use date type (e.g., created_at field).
Analysis requirements:
- Aggregation operations: Prioritize keyword or date types. For example, aggregating status requires keyword.
- Text analysis: Use text for tokenization; use keyword to preserve original values.
Storage efficiency:
- text types consume more storage (post-tokenization) for large text; keyword types are smaller for small-value fields.
- For high-frequency query fields, prioritize keyword to reduce indexing overhead.
Avoid common pitfalls:
- Incorrect example: Executing term query on text field:

json
"query": {
  "term": {
    "title": { "value": "Elasticsearch" }
  }
}

Results in tokenization errors and empty responses.

Correct approach: Add a keyword subfield to text fields or use text with match queries.

Code Example: Index Mapping Design

Here is a practical index mapping example showing correct mixed-type field selection:

json
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "status": {
        "type": "keyword",
        "ignore_above": 256
      },
      "price": {
        "type": "float",
        "format": "currency"
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
      },
      "ip_address": {
        "type": "ip"
      },
      "tags": {
        "type": "nested",
        "properties": {
          "name": { "type": "keyword" }
        }
      }
    }
  }
}

Best practices summary:

Always explicitly declare field types to avoid defaults.
Match type to use case: text for search, keyword for exact matches.
Optimize for performance: Use keyword subfields for text fields when needed.
Validate with real data: Test queries to ensure correct type handling.