乐闻世界logo
搜索文章和话题

What is the difference between lucene and elasticsearch

3个答案

1
2
3

Lucene and Elasticsearch differ primarily in their positioning within the search technology stack. Lucene is an open-source full-text search library used for building search engines, while Elasticsearch is built on top of Lucene and functions as an open-source search and analytics engine. It provides a distributed, multi-user full-text search solution with an HTTP web interface and support for schema-less JSON document processing.

Below are the key differences between Lucene and Elasticsearch:

Lucene:

  1. Core Search Library: Lucene is a Java library offering low-level APIs for full-text search functionality. It is not a complete search engine but rather a tool for developers to construct search engines.

  2. Core Technologies: It handles fundamental operations such as index creation, query parsing, and search execution.

  3. Development Complexity: Using Lucene requires deep expertise in indexing structures and search algorithms, as developers must write extensive code to manage indexing, querying, and ranking of search results.

  4. Distributed Capabilities: Lucene does not natively support distributed search; developers must implement this functionality themselves.

  5. APIs: Lucene primarily serves through Java APIs, necessitating additional encapsulation or bridging technologies for non-Java environments.

Elasticsearch:

  1. Complete Search Engine: Elasticsearch is a real-time distributed search and analytics engine ready for production deployment.

  2. Built on Lucene: Elasticsearch leverages Lucene at the low level for indexing and searching but provides a user-friendly RESTful API, enabling developers to index and query data using JSON.

  3. Simplified Operations: Elasticsearch streamlines the complex process of building search engines by offering ready-to-use solutions, including cluster management, data analysis, and monitoring.

  4. Distributed Architecture: Elasticsearch natively supports distributed and scalable architectures, efficiently handling data at the petabyte level.

  5. Multi-language Clients: Elasticsearch provides clients in multiple languages, facilitating seamless integration and usage across diverse development environments.

Practical Application:

Suppose we are developing a search feature for a website:

  • If using Lucene, we must customize data models, build indexes, handle search queries, implement ranking algorithms, and manage highlighting, while integrating these features into the website. This demands high developer expertise due to the need for deep Lucene knowledge and handling low-level details.

  • If using Elasticsearch, we can directly index article content via HTTP requests. When a user enters a query in the search box, we send an HTTP request to Elasticsearch, which processes the query and returns well-formatted JSON results, including top-ranked documents and highlighted search terms. This significantly simplifies the development and maintenance of the search system.

2024年6月29日 12:07 回复

Elasticsearch Index vs Lucene Index.

An Elasticsearch index is a collection of documents, similar to how a relational database is composed of tables.

To achieve scalability, we distribute the Elasticsearch index across multiple physical nodes/servers.

For this purpose, we split the Elasticsearch index into smaller units called shards.

Question: How is it related to the Lucene index?

If we want to search for a specific term (e.g., 'Cake' or 'Cookie'), we must check each shard and find it (let's not consider the placement and replication of shards across nodes for now).

This operation can be time-consuming, so we need an efficient data structure for the search—this is where the Lucene index comes into play.

Each Elasticsearch shard is based on the Lucene index structure and stores statistical information about terms to make term-based searches more efficient.

(!) This can be confusing because the term 'index' and the fact that Elasticsearch shards are part of the Elasticsearch index but are based on the Lucene index structure.

Bonus - Lucene Index as an Inverted Index

As shown in the example below, the Lucene index stores the original document content along with additional information, such as a term dictionary and term frequencies, which improve search efficiency:

TermDocumentFrequency
Cakedoc_id_1, doc_id_84 (2 in doc_id_1, 2 in doc_id_8)
Cookiedoc_id_1, doc_id_63 (2 in doc_id_1, 1 in doc_id_6)
Spaghettidoc_id_121 (1 in doc_id_12)

The Lucene index belongs to the inverted index family. This is because it can list the documents containing a specific term. This is the opposite of the natural relationship, where documents list terms.

(Reminder) How do we go from shards to terms?

(1) A shard is a directory containing document files. (2) A document is a sequence of fields. (3) A field is a named sequence of terms.

2024年6月29日 12:07 回复

Lucene is a Java library. You can include it in your project and use function calls to invoke its functions.

Elasticsearch is a distributed web server that is JSON-based, built on Lucene. While the underlying work is handled by Lucene, Elasticsearch provides a convenient layer built on top of Lucene. Each shard in Elasticsearch is a separate Lucene instance. To summarize:

  1. Elasticsearch is built on Lucene and provides a JSON-based REST API to access Lucene features.
  2. Elasticsearch provides a distributed system built on Lucene. Lucene does not handle or build distributed systems; Elasticsearch provides an abstraction for this distributed structure.
  3. Elasticsearch also provides other supporting features, such as thread pools, queues, node/cluster monitoring APIs, data monitoring APIs, and cluster management.
2024年6月29日 12:07 回复

你的答案