Software Engineer, builder of webapps

Elasticsearch date histogram aggregation - filling in the empty buckets

When it comes segmenting data to be visualized, Elasticsearch has become my go-to database as it will basically do all the work for me. Need to find how many times a specific search term shows up in a data field? It can do that for you. Need to sum the totals of a collection of placed orders over a time period? It can do that too. Today though Im going to be talking about generating a date histogram, but this one is a little special because it uses Elasticsearch's new aggregations feature (basically facets on steroids) that will allow us to fill in some empty holes.

First came facets

Back before v1.0, Elasticsearch started with this cool feature called facets. A facet was a built-in way to quey and aggregate your data in a statistical fashion. Like I said in my introduction, you could analyze the number of times a term showed up in a field, you could sum together fields to get a total, mean, media, etc. You could even have Elasticsearch generate a histogram or even a date histogram (a histogram over time) for you. The date histogram was particulary interesting as you could give it an interval to bucket the data into. This could be anything from a second to a minute to two weeks, etc. That was about as far as you could go with it though.

Aggregations - facets on steroids

With the release of Elasticsearch v1.0 came aggregations. If you look at the aggregation syntax, they look pretty simliar to facets. A lot of the facet types are also available as aggregations. The general structure for aggregations looks something like this:

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : {
            <aggregation_body>
        }
        [,"aggregations" : { [<sub_aggregation>]+ } ]?
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

Lets take a quick look at a basic date histogram facet and aggregation:

// this is a facet
{
    query: {
        match_all: {}
    },
    facet: {
        some_date_facet: {
            date_histogram: {
                key_field: "timestamp",
                value_field: "widgets_sold",
                interval: "day"
            }
        }
    }
}

// this is an aggregation
{
    query: {
        match_all: {}
    },
    aggregations: {
        some_date_facet: {
            date_histogram: {
                field: "timestamp",
                interval: "day"
            }
        }
    }
}

They look pretty much the same, though they return fairly different data. The facet date histogram will return to you stats for each date bucket whereas the aggregation will return a bucket with the number of matching documents for each. The reason for this is because aggregations can be combined and nested together. So if you wanted data similar to the facet, you could them run a stats aggregation on each bucket.

{
    query: {
        match_all: {}
    },
    aggregations: {
        some_date_facet: {
            date_histogram: {
                field: "timestamp",
                interval: "day"
            },
            aggregations: {
                bucket_stats: {
                    stats: {
                        field: "widgets_sold"
                    }
                }
            }
        }
    }
}

Filling in the missing holes

One of the issues that Ive run into before with the date histogram facet is that it will only return buckets based on the applicable data. This means that if you are trying to get the stats over a date range, and nothing matches it will return nothing. If Im trying to draw a graph, this isnt very helpful. One of the new features in the date histogram aggregation is the ability to fill in those holes in the data. I'll walk you through an example of how it works.

Lets first get some data into our Elasticsearch database. We're going to create an index called dates and a type called entry.

curl -XPOST http://elasticsearch.local:9200/dates/entry -d '{ "date": "2014-05-21T00:00:00.000Z", "value": 10 }'
curl -XPOST http://elasticsearch.local:9200/dates/entry -d '{ "date": "2014-05-22T00:00:00.000Z", "value": 10 }'
curl -XPOST http://elasticsearch.local:9200/dates/entry -d '{ "date": "2014-05-23T00:00:00.000Z", "value": 10 }'
curl -XPOST http://elasticsearch.local:9200/dates/entry -d '{ "date": "2014-05-26T00:00:00.000Z", "value": 10 }'
curl -XPOST http://elasticsearch.local:9200/dates/entry -d '{ "date": "2014-05-30T00:00:00.000Z", "value": 10 }'
curl -XPOST http://elasticsearch.local:9200/dates/entry -d '{ "date": "2014-06-10T00:00:00.000Z", "value": 10 }'
curl -XPOST http://elasticsearch.local:9200/dates/entry -d '{ "date": "2014-05-11T00:00:00.000Z", "value": 10 }'
curl -XPOST http://elasticsearch.local:9200/dates/entry -d '{ "date": "2014-05-12T00:00:00.000Z", "value": 10 }'

Run that and it'll insert some dates that have some gaps in between. Lets now create an aggregation that calculates the number of documents per day:

curl -XGET http://elasticsearch.local:9200/dates/entry/_search -d '
{
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "dates_with_holes": {
      "date_histogram": {
        "field": "date",
        "interval": "day"
      }
    }
  }
}
'

If we run that, we'll get a result with an aggregations object that looks like this:

"aggregations":{
  "dates_with_holes":{
     "buckets":[
        {
           "key_as_string":"2014-05-11T00:00:00.000Z",
           "key":1399766400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-12T00:00:00.000Z",
           "key":1399852800000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-21T00:00:00.000Z",
           "key":1400630400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-22T00:00:00.000Z",
           "key":1400716800000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-23T00:00:00.000Z",
           "key":1400803200000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-26T00:00:00.000Z",
           "key":1401062400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-30T00:00:00.000Z",
           "key":1401408000000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-06-10T00:00:00.000Z",
           "key":1402358400000,
           "doc_count":1
        }
     ]
  }
}

As you can see, it returned a bucket for each date that was matched. In this case since each date we inserted was unique, it returned one for each. Thats cool, but what if we want the gaps between dates filled in with a zero value? Turns out there is an option you can provide to do this, and it is min_doc_count. In this case we'll specify min_doc_count: 0. Our new query will then look like:

curl -XGET http://elasticsearch.local:9200/dates/entry/_search -d '
{
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "dates_with_holes": {
      "date_histogram": {
        "field": "date",
        "interval": "day",
        "min_doc_count": 0
      }
    }
  }
}
'

Alright, now we have some zeros:

"aggregations":{
  "dates_with_holes":{
     "buckets":[
        {
           "key_as_string":"2014-05-11T00:00:00.000Z",
           "key":1399766400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-12T00:00:00.000Z",
           "key":1399852800000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-13T00:00:00.000Z",
           "key":1399939200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-14T00:00:00.000Z",
           "key":1400025600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-15T00:00:00.000Z",
           "key":1400112000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-16T00:00:00.000Z",
           "key":1400198400000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-17T00:00:00.000Z",
           "key":1400284800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-18T00:00:00.000Z",
           "key":1400371200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-19T00:00:00.000Z",
           "key":1400457600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-20T00:00:00.000Z",
           "key":1400544000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-21T00:00:00.000Z",
           "key":1400630400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-22T00:00:00.000Z",
           "key":1400716800000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-23T00:00:00.000Z",
           "key":1400803200000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-24T00:00:00.000Z",
           "key":1400889600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-25T00:00:00.000Z",
           "key":1400976000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-26T00:00:00.000Z",
           "key":1401062400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-27T00:00:00.000Z",
           "key":1401148800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-28T00:00:00.000Z",
           "key":1401235200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-29T00:00:00.000Z",
           "key":1401321600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-30T00:00:00.000Z",
           "key":1401408000000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-31T00:00:00.000Z",
           "key":1401494400000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-01T00:00:00.000Z",
           "key":1401580800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-02T00:00:00.000Z",
           "key":1401667200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-03T00:00:00.000Z",
           "key":1401753600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-04T00:00:00.000Z",
           "key":1401840000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-05T00:00:00.000Z",
           "key":1401926400000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-06T00:00:00.000Z",
           "key":1402012800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-07T00:00:00.000Z",
           "key":1402099200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-08T00:00:00.000Z",
           "key":1402185600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-09T00:00:00.000Z",
           "key":1402272000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-10T00:00:00.000Z",
           "key":1402358400000,
           "doc_count":1
        }
     ]
  }
}

All of the gaps are now filled in with zeroes. Here comes our next use case; say I want to aggregate documents for dates that are between 5/1/2014 and 5/30/2014 by day. Our data starts at 5/21/2014 so we'll have 5 data points present, plus another 5 that are zeroes. But what about everything from 5/1/2014 to 5/20/2014? Turns out, we can actually tell Elasticsearch to populate that data as well by passing an extended_bounds object which takes a min and max value. This way we can generate any data that might be missing that isnt between existing datapoints. Our query now becomes:

curl -XGET http://elasticsearch.local:9200/dates/entry/_search -d '
{
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "dates_with_holes": {
      "date_histogram": {
        "field": "date",
        "interval": "day",
        "min_doc_count": 0,
        "extended_bounds": {
            "min": 1398927600000,
            "max": 1401433200000
        }
      }
    }
  }
}
'

The weird caveat to this is that the min and max values have to be numerical timestamps, not a date string. Now our resultset looks like this:

"aggregations":{
  "dates_with_holes":{
     "buckets":[
        {
           "key_as_string":"2014-05-01T00:00:00.000Z",
           "key":1398902400000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-02T00:00:00.000Z",
           "key":1398988800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-03T00:00:00.000Z",
           "key":1399075200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-04T00:00:00.000Z",
           "key":1399161600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-05T00:00:00.000Z",
           "key":1399248000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-06T00:00:00.000Z",
           "key":1399334400000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-07T00:00:00.000Z",
           "key":1399420800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-08T00:00:00.000Z",
           "key":1399507200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-09T00:00:00.000Z",
           "key":1399593600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-10T00:00:00.000Z",
           "key":1399680000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-11T00:00:00.000Z",
           "key":1399766400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-12T00:00:00.000Z",
           "key":1399852800000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-13T00:00:00.000Z",
           "key":1399939200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-14T00:00:00.000Z",
           "key":1400025600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-15T00:00:00.000Z",
           "key":1400112000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-16T00:00:00.000Z",
           "key":1400198400000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-17T00:00:00.000Z",
           "key":1400284800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-18T00:00:00.000Z",
           "key":1400371200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-19T00:00:00.000Z",
           "key":1400457600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-20T00:00:00.000Z",
           "key":1400544000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-21T00:00:00.000Z",
           "key":1400630400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-22T00:00:00.000Z",
           "key":1400716800000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-23T00:00:00.000Z",
           "key":1400803200000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-24T00:00:00.000Z",
           "key":1400889600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-25T00:00:00.000Z",
           "key":1400976000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-26T00:00:00.000Z",
           "key":1401062400000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-27T00:00:00.000Z",
           "key":1401148800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-28T00:00:00.000Z",
           "key":1401235200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-29T00:00:00.000Z",
           "key":1401321600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-05-30T00:00:00.000Z",
           "key":1401408000000,
           "doc_count":1
        },
        {
           "key_as_string":"2014-05-31T00:00:00.000Z",
           "key":1401494400000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-01T00:00:00.000Z",
           "key":1401580800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-02T00:00:00.000Z",
           "key":1401667200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-03T00:00:00.000Z",
           "key":1401753600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-04T00:00:00.000Z",
           "key":1401840000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-05T00:00:00.000Z",
           "key":1401926400000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-06T00:00:00.000Z",
           "key":1402012800000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-07T00:00:00.000Z",
           "key":1402099200000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-08T00:00:00.000Z",
           "key":1402185600000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-09T00:00:00.000Z",
           "key":1402272000000,
           "doc_count":0
        },
        {
           "key_as_string":"2014-06-10T00:00:00.000Z",
           "key":1402358400000,
           "doc_count":1
        }
     ]
  }
}

Elasticsearch returned to us points for every day in our min/max value range.

That about does it for this particular feature. Now if we wanted to, we could take the returned data and drop it into a graph pretty easily or we could go onto run a nested aggregation on the data in each bucket if we wanted to.