Elastic, data aggregation
In my previous articles i covered the search API to query data. Now, i am going to explain the search API to aggregate data. First thing i am going to look at is what is metrics aggregations, basically producing a numerical output. Then i am going to look at doing bucket aggregations. This is how we can categorize or bucketize our data. And then ii am going to combine these 2 together to do sub-aggregations. Aggregations within aggregations. And then lastly, how to elevate aggregation even further by going over pipeline aggregations. These are ways of taking the output of 1 aggregation
and making it the input of another. I will cover the different types
of pipeline aggregations as well .
The first aggregation type is metrics aggregations. Metrics aggregations just compute a numeric value of something. Can be either a single-value or multi-value aggregation. First set you dataset and make sure its up anbd running with the _cat/indecis?v API. You will use the _search API call which is used both for queries and aggregation. Whenever we’re doing an aggregation, we don’t really care about the hits array. We don’t care about the top 10 or whatever documents that match our query. All we care about is the aggregation output. Whenever we’re not doing a query and we’re just doing an aggregation, it makes sense to go ahead and just set that size
of the hits array to 0. Next when setting the aggregation call , we’re going to define our aggregation. First give it a name to create its own entity, let's say for the example :total_sales. It can be any name you want. Once we name our aggregation, we can define which aggregation type we’re going to do.
For example, metrics aggregation using SUM field for taxless_total_price. This aggregation is going to tell us what is the SUM, or the total sales from our sales dataset. There are several aggregations criteria's we can come up with on any data. For example an average is an aggregation as an AGG_TYPE. In this case, in order to do it you must selecta field that will be used for aggregation, the field should be based on text that could be analyse or not analyse for example “sales_per_day” from the data set and the avg aggregation with calculate based on your define size of the out put you want to get.
With bucket aggregations, we’re just creating a bucket or a group of documents. We’re establishing a criterion to categorize our documents based on the date field. For example, the order date from an ecommerce data,
create a bucket for every day, and then we’re going to put all of our documents in whatever day bucket they belong to. So we’re just bucketizing our data out. We just want to see our aggregation output. Next setp will be the actual aggregation you want to do. The bucket aggregation, we can give it any name we want. For example calling it “orders_per_day” just to be descriptive and to be intuitive on what we’re doing. Then define the actual aggregation that we’re performing. For example a date_histogram.
Another aggregation to menstion is terms. As covered in previous articals terms are used as parts of quesries and using them to define specific term or terms could be used got buckets. Intervals is also a filed to use, defining in which intervals we want our bucked be created based one days, months or even seconds. There are tons of criteria to use and you can find them in elastic documentation.
I have covered metrics and bucket aggregations, now we can combine them together using sub-aggregations. Sub-aggregations allow us to aggregate
something per bucket, or per output of the previous aggregation.
For example the parent aggregation is called total_sales_per_day.
Benith it we can ctreated more aggregation that reflects data output that ar erelated to it, complex ? Think about i want to view all my sames with sub aggragtion of sales before VAT/TAX i can sub aggrifated it and put them into a bucket that shows it per month or day. So each bucket is going to have its own metric calculation output.
There’s 2 different pipeline aggregations. The first is the parent pipeline aggregation. What makes a pipeline aggregation, a parent pipeline aggregation is that the input is from a parent. If that sounds confusing,
it’s probably because it is. An example for parent aggregation , in this case, is our date_histogram aggregation. Then we’re doing a sub-aggregation,
which is the sum, and then we’re doing a parent pipeline aggregation. We’re taking the output of that sum aggregation and we’re piping it into our cumulative sum aggregation. That’s why its called a pipeline because it’s taking as its input, the output of something else. The other type of pipeline aggregation, which is the sibling pipeline aggregation. This takes the output
of a sibling aggregation as its input, which is very similar to parent pipeline aggregations, except in this case, our pipeline agg is actually a sibling to the top-most aggregation. It’s not a sub-aggregation. Pipeline aggregation is
either a parent or a sibling, depending on where it’s input is coming from.
If it’s input is the output of a parent, it’s a parent pipeline agg. If it’s input is the output of a sibling, it’s a sibling pipeline agg.
A recap what i covered in this article. First covered metrics aggregations,
that help with creating a single- or multi-value numerical output
using data. Then bucket aggregations, breaking data down like a date histogram, or terms aggregation to show how many documents belong
to a different value for a field. Then combine these together.
Sub-aggregations. Metrics aggregation within bucket agg,
or doing, multiple layers of bucket aggregations, and then a metrics aggregation. If that didn’t get complicated enough, i made it even worse by doing pipeline aggregations. Creating aggregations whose input
is the output of another aggregation. Hopefully high level article helped
understand the different types of aggregations in Elasticsearch.