Beginner's Guide to Elasticsearch

For the uninitiated, Elasticsearch is a schema free, JSON document based search server built on top of the indexing library Lucene. While it does provide a powerful full-text search system, Elasticsearch provides many other features that make it great for things like aggregating statistics. In this post, I am going to walk through how to setup Elasticsearch and the basics of storing and querying data.

Setup

For this setup, we'll be using Ubuntu 12.04 LTS as our base operating system.

Because Elasticsearch is written in Java, we need to install the JVM.

sudo aptitude install openjdk-7-jre-headless

We won't be needing any of the UI related features of Java, so we can same some space and time by installing the headless version.

Next, we need to download Elasticsearch. We're going to just download the standalone tarball, but there is also a deb package available if you wish to install it with dpkg.

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.9.tar.gz
tar -xvf elasticsearch-0.90.9.tar.gz

Now that we have it downloaded, running it is as simple as executing the included binary.

cd elasticsearch-0.90.9
./bin/elasticsearch

By default, Elasticsearch will run as a background daemon process. For this post, we're going to run it in the foreground so that we can watch the logs. This can be accomplished by providing the -f flag.

./bin/elasticsearch -f

Terminology

Before we start, there is some vocabulary that we need to familiarize ourselves with:

Node

A node is a Java process running Elasticsearch. Typically each node will run on its own dedicated machine.

Cluster

A cluster is one or more nodes with the same cluster name.

Shard

A shard is a Lucene index. Every index can have one or more shards. Shards can be either the primary shard or a replica.

Index

An index is the rough equivalent of database in relational database land. The index is the top-most level that can be found at http://yourdomain.com:9200/<your index>

Types

Types are objects that are contained within indexes. Think of them like tables. Being a child of the index, they can be found at http://yourdomain.com:9200/<your index>/<some type>

Documents

Documents are found within types. These are basically JSON blobs that can be of any structure. Think of them like rows found in a table.

Querying Elasticsearch

Out of the box Elasticsearch comes with a RESTful API that we'll be using to make our queries. Im running Elasticsearch locally on localhost, so all examples will be in regards to it, but simply replace localhost with your fully qualified domain. By default, this means the URL we'll be using is http://localhost:9200/

Creating an index

First thing we need to do is create an index. We're going to call our index "testindex". To do this, we simply need to make a POST request to http://localhost:9200/testindex

curl -X POST http://localhost:9200/testindex -d  '{}'

When you create an index, there are a number of options that you can pass along. Things such as mapping definitions and settings for number of shards and replicas. For now, we're just going to post an empty object. We'll revisit mappings later on in a more advanced post.

Inserting a document

To insert our first document, we need a type. For this example, we'll be using mySuperCoolType and we'll be inserting a document with a full name field and a field for a twitter handle.

curl -X POST http://localhost:9200/testindex/mySuperCoolType -d  '
{
    "fullName": "Sean McGary",
    "twitterHandle": "@seanmcagry"
}'

// response
{"ok":true,"_index":"testindex","_type":"mySuperCoolType","_id":"N_c9-SQ8RRSrRwPIBqG6Ow","_version":1}

Querying

Now that we have a document, we can start to query our data. Since we didnt provide any specifics around field mappings, Elasticsearch will try to determine the type of the field (string, number, object, etc) and run the default analyzers and indexers on it.

To test this, we'll query our collection to try and match the full name field.

curl -X GET http://localhost:9200/testindex/mySuperCoolType/_search -d '
{
    "query": {
        "match": {
            "fullName": "Sean"
        }
    }
}'

// result
{
   "took":2,
   "timed_out":false,
   "_shards":{
      "total":5,
      "successful":5,
      "failed":0
   },
   "hits":{
      "total":1,
      "max_score":0.19178301,
      "hits":[
         {
            "_index":"testindex",
            "_type":"mySuperCoolType",
            "_id":"N_c9-SQ8RRSrRwPIBqG6Ow",
            "_score":0.19178301,
            "_source":{
               "fullName":"Sean McGary",
               "twitterHandle":"@seanmcgary"
            }
         }
      ]
   }
}

When you make a query, Elasticsearch will spit back a bunch of data, like if it timed out, how long it took, how many shards it queried against and how many succeeded/failed. The last field that it returns is the "hits" object. This is where all of you results will appear. In the root hits object, it will tell you the number of matches found, the max score of all the hits, and of course, the array of hits. Each hit includes meta info (prepended with an underscore) such as the (auto assigned) ID, the score and the source, which is the original document data you inserted. Full text searching is one of the stong features of Elasticsearch, so when it performs the search, it will rank all matches based on their score. The score is how close of a match each document is to the original query. The score can be modified if you wish to add additional weight based on certain paramters. We'll cover that in a later, more advanced post.

As you can see here in our results, we got one match by querying for "Sean" in the fullName field. Becuase we didnt specify a mapping, Elasticsearch applied the "standard analyzer" to the fullName field. The standard analyzer takes the contents of the field (in this case a string), lowercases all letters, removes comon stopwords (words like "and" and "the") and splits the string on spaces. This is why when we query "Sean" it matches "Sean McGary".

Lets take a look at another query. This time though, we're going to apply a filter to the results.

curl -X GET http://localhost:9200/testindex/mySuperCoolType/_search -d '
{
    "query": {
        "match": {
            "fullName": "Sean"
        }
    },
    filter: {
        "query": {
            "match": {
                "twitterHandle": "seanmcgary"
            }
        }
    }
}'

This particular request returns exactly what we had before, but lets break it down a little bit. To start, it's important to understand the difference between queries and filters. Queries are performed initially on the entire dataset. Here that is the "mySuperCoolType" type. Elasticsearch will then apply the filter to the result set of the query before returning the data. Unlike queries though, filters are cached which can improve performance.

Conclusion

This concludes our introduction to Elasticsearch. In followup post, I'll introduce some more advanced features such as setting up mappings, custom analyzers and indexers, and get into how to use facets for things such as analytics and creating histograms.