Elasticsearch: How Indexing and Mapping Work?

In modern IT architectures, Elasticsearch is widely used for log analysis, full-text search, and real-time data processing. Indexing and mapping form the foundation of its data model: Index corresponds to tables in traditional databases but is implemented as distributed storage through shards and replicas; Mapping is analogous to a database's schema, describing storage rules for fields. If mapping is configured incorrectly, it can lead to decreased query performance or data loss. This article, based on Elasticsearch 8.x, combines official documentation and practical examples for professional analysis.

Basic Concepts of Indexing

Index is a data container in Elasticsearch, composed of multiple shards (Shard), each being an independent Lucene index. Shards enable horizontal scaling, while replicas (Replica) provide high availability. When data is written, Elasticsearch:

Distributes documents across different nodes based on shard strategies (e.g., hash sharding).
Builds an inverted index (Inverted Index) for each shard to facilitate rapid retrieval.

Key Features: Index names are logical namespaces (e.g., products), but physically may span multiple nodes. For example, an index with 5 shards can be distributed across 5 nodes, with each shard configured to have 2 replicas.

Role of Shards: Horizontal scaling for storage and query load. For instance, on a 100GB dataset, shard count directly impacts parallel processing capability.
Role of Replicas: Ensures data redundancy and boosts read throughput. If the cluster has 3 nodes, with replica count of 1, read requests can be distributed across primary and replica shards.

Elasticsearch automatically initializes shards and replicas during index creation. For large datasets, carefully plan shard size (recommended 5-15GB per shard) to avoid performance overhead from excessive shards.

Basic Concepts of Mapping

Mapping defines the metadata of fields in an index, including data types, analyzers, nested structures, etc. It comes in two modes:

Dynamic Mapping: Elasticsearch automatically infers field types (e.g., text or date), suitable for rapid prototyping.
Explicit Mapping: Manually define field rules to avoid dynamic inference errors.

Core Elements:

Data Types: text (for full-text search), keyword (for exact matching), date (for timestamps), etc.
Analyzers: Determine how text is tokenized. For example, standard analyzer defaults to tokenization, while snowball is specialized for English stemming.
Nested Objects: Handle complex structures, such as product lists in orders.

Mapping configuration directly affects query efficiency. Incorrect configuration may lead to:

Text fields misused as keyword, affecting full-text search.
Date fields with mismatched formats causing query failures.

For example, explicit mapping is defined as:

json
{
  "mappings": {
    "properties": {
      "name": { "type": "text", "analyzer": "standard" },
      "price": { "type": "float" },
      "created_at": { "type": "date", "format": "yyyy-MM-dd" }
    }
  }
}

How Indexing and Mapping Work Together

Indexing and mapping collaborate closely: when documents are indexed, Elasticsearch parses fields based on mapping to build an inverted index. The process includes:

Data Ingestion: Documents are sent via PUT requests to the cluster.
Mapping Application: Elasticsearch processes fields according to mapping rules:
- Text fields are tokenized by analyzers (e.g., name field is split into laptop and computer).
- Numeric fields are stored directly as numbers.
Index Construction: Shards write tokenized terms into Lucene index, forming an inverted index structure (term → document ID list).

Key Mechanisms:

Dynamic Mapping Risks: If description field is dynamically identified as text but contains numbers, it may lead to inefficient indexing. Explicit mapping can enforce type specification for better performance.
Index Lifecycle: Mapping defines how documents are processed, while indexing manages storage and queries. For example, querying GET /products/_search uses the analyzer defined in mapping for search.

Here is a simplified workflow diagram:

Elasticsearch Indexing and Mapping Workflow

Practical Examples

Creating Index and Mapping

Use curl command to explicitly define mapping:

bash
# Create index with specified mapping
PUT /products
{
  "mappings": {
    "properties": {
      "name": { "type": "text", "analyzer": "standard" },
      "price": { "type": "float" },
      "tags": { "type": "keyword" }
    }
  }
}

Output Verification:

json
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "products"
}

Query Example

Execute full-text search:

bash
GET /products/_search
{
  "query": {
    "match": {
      "name": "laptop"
    }
  }
}

Result Analysis:

Due to the standard analyzer used for name field in mapping, the query matches tokenized terms.
If mapping is incorrect (e.g., name as keyword), it returns exact match results, failing full-text search.

Optimization Practices

Avoid Dynamic Mapping: After index creation, use PUT /products/_mapping to explicitly adjust fields, preventing unintended type inference.
Type Optimization:
- Text fields: Use text type with specified analyzer (e.g., whitespace for space-separated).
- Numeric fields: Ensure not misused as text, avoiding invalid queries.
Shard Strategy: Choose shard size based on data volume. For example, a 100GB dataset recommends 3-5 shards to avoid performance issues from oversized shards.

Common Issues and Best Practices

Common Pitfalls

Mapping Conflicts: Dynamic mapping may cause inconsistent field types. For example, price field incorrectly identified as text leads to failed range queries.
Inappropriate Analyzer Selection: Using standard analyzer for Chinese text causes tokenization errors (Chinese should use ik_max_word analyzer).

Best Practices

Explicitly Define Mapping: Specify all fields during index creation to avoid dynamic inference. Refer to Elasticsearch Official Documentation.
Use Field Aliases: Create aliases for fields (e.g., title aliased as post_title) to simplify queries.
Monitor Mapping: Check index status via _mapping API:

bash
GET /products/_mapping

Performance Tuning:
- For high-frequency query fields, use keyword type instead of text.
- Shard count should be based on cluster node count (recommended 3-5 nodes with 3-5 shards).

Performance Recommendations

Index Optimization: Avoid storing large text in text fields (e.g., description), as it affects tokenization performance.
Error Handling: If mapping is incorrect, Elasticsearch returns 400 Bad Request; check the error field in response.
Production Environment: Test mapping configuration with small datasets before deployment, using PUT /_template to predefine templates.

Conclusion

Elasticsearch indexing and mapping are the foundation for building efficient search systems. Index manages data containers and shards, while mapping defines field rules, working together to ensure query performance. By using explicit mapping, reasonable sharding, and proper analyzer selection, developers can avoid common pitfalls and enhance application reliability. Always prioritize explicit mapping and leverage Elasticsearch monitoring tools (e.g., Kibana) for continuous optimization. Deep understanding of this mechanism provides strong support for log analysis, real-time search, and large-scale data processing. Remember: mapping configuration is the critical starting point, not the endpoint.