When handling multilingual text analysis, Elasticsearch provides robust capabilities through several key approaches:
1. Built-in Analyzers
Elasticsearch offers preconfigured analyzers for various languages, which handle language-specific text tokenization and processing. For instance, it includes built-in analyzers for English, French, Spanish, and other languages. These analyzers typically consist of tokenizers, character filters, and token filters.
Example:
To analyze Chinese content, use the built-in smartcn analyzer:
jsonPUT /my_index { "settings": { "analysis": { "analyzer": { "default": { "type": "smartcn" } } } } }
2. Plugin Support
Elasticsearch enables extending language analysis capabilities via plugins. For example, for Chinese, Japanese, and Korean, install corresponding analyzer plugins such as elasticsearch-analysis-icu or elasticsearch-analysis-kuromoji (for Japanese).
Example:
Install the Japanese analyzer plugin kuromoji:
bash./bin/elasticsearch-plugin install analysis-kuromoji
Then configure it in index settings:
jsonPUT /japanese_index { "settings": { "analysis": { "analyzer": { "default": { "type": "kuromoji_analyzer" } } } } }
3. Custom Analyzers
If built-in analyzers and plugins do not meet specific requirements, Elasticsearch allows creating custom analyzers. By combining custom tokenizers, filters, and other components, users can precisely control text processing.
Example: Create a custom analyzer with language-specific stopword handling:
jsonPUT /custom_index { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "my_stopwords"] } }, "filter": { "my_stopwords": { "type": "stop", "stopwords": ["例子", "因此"] } } } } }
4. Multi-field Support
Within a single index, multiple language analyzers can be applied to the same text field. This allows a document to support searches in multiple languages simultaneously.
Example:
jsonPUT /multi_language_index { "mappings": { "properties": { "text": { "type": "text", "fields": { "english": { "type": "text", "analyzer": "english" }, "french": { "type": "text", "analyzer": "french" } } } } } }
In summary, Elasticsearch effectively supports multilingual text analysis and search through built-in analyzers, plugins, custom analyzers, and multi-field support, establishing it as a powerful multilingual search engine.