How many of you have ever had to solve one of these problems?

“Given a free text field, implement the search for products in a catalogue by name or category, showing first the results corresponding to the name and providing an autocompletion to the user.”

“Implement a search field for articles with which it is possible to search for them by name tolerating any typing errors.”

“Given a database of products sold, identify the 10 best-selling categories of the current calendar year and return the average price for each.”

If you are one of them or have faced similar problems, Elasticsearch is the tool for you!

Introduction

Elasticsearch is a distributed search engine, used to perform full-text search and data analysis. For example, Wikipedia uses Elasticsearch to perform an instant search for articles and to provide related suggestions, while GitHub uses it to search for code and Stack Overflow to search for and to show related questions.

Technically speaking Elasticsearch is a software written in Java and based on Apache Lucene, a Java library that very efficiently implements full-text search. But Lucene is just a library and to use it you need to write your own application in Java and to integrate directly with it. Even worse, Lucene is very complex to use and writing an application that uses it requires a thorough knowledge of full-text search theory. Elasticsearch uses Lucene internally but exposes a simple RESTful API which abstracts its complexity and which can be used directly or through one of the clients written in numerous languages​(Javascript, Python, Ruby, Java, .NET, Perl, etc.).

Elastichsearch Lucene

But Elasticsearch is also a document-oriented database, serialised as JSON, capable of storing complex documents, searching for them and with a distributed structure that is used to easily scale up to have hundreds of servers and to contain petabytes of data.

The structure of a cluster

The classic Elasticsearch configuration is that of a cluster with one or several nodes where each node, corresponding to the single instance of Elasticsearch, is installed on a server or on a virtual machine.

An orthogonal concept is that of the index, which refers to a collection of documents having one or more types (conceivable therefore as a database in terms of relational databases). Each index consists of one or more shards, each corresponding to a single instance of Lucene, and which can be of two types:

The documents that are part of an index will be divided equally among all the primary shards that compose the index.

Elasticsearch automatically distributes the shards configured within the cluster in order to optimise robustness and performance. In fact, when a client queries a node of the cluster, that node will act as coordinator, querying all the shards involved in the search, even belonging to other nodes, and combining the various answers in a single one, applying aggregation logics if present.

Elasticsearch node

Thanks to this distributed and flexible structure, Elasticsearch can efficiently manage a limited number of documents but can easily scale up to manage millions of them.

Searching

If it is possible to search for documents in Elasticsearch with little effort using a simple query string capable of supporting the most common operators; the true power of Elasticsearch emerges using its DSL query. In this second variant, not only is it possible to search for the various documents that meet the requirements, but also to influence the scoring algorithm used by Elasticsearch by rewarding certain documents rather than others or by filtering some search results according to fairly complex criteria.

In addition to this, Elasticsearch makes it possible to highlight research results, to aggregate results obtained or to implement advanced logics, such as “more like this” or “did you mean?”.

The first execution

Installing and launching Elasticsearch is as easy as drinking a glass of water. Being written in Java, Elasticsearch requires a Java virtual machine (Oracle or OpenJDK, in versions 7 or 8) and is capable of running smoothly on a large number of Linux distributions, on Windows Server and on Solaris.

After downloading Elasticsearch (from the official website or repositories via apt or yum) and extracting the archive to start Elasticsearch it will be sufficient to run on * nix systems:

$ /bin/elasticsearch

while on Windows systems:

$ /bin/elasticsearch.bat

and to start to experiment!

$ curl -X GET http://localhost:9200/

{
  "status" : 200,
  "name" : "Xemu",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.3",
    "build_hash" : "05d4530971ef0ea46d0f4fa6ee64dbc8df659682",
    "build_timestamp" : "2015-10-15T09:14:17Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

For further information: