Managing data with Elastic
In this article i am going to focus on how we explain how to manage data in Elasticsearch and how we manage indices. First we need to understand
how to define indices in Elasticsearch, then how to manage them using index templates. That way we don’t have to create every single index we’re ever going to need manually. Then i am going to cover on how to upload data
into Elasticsearch using the Data Visualizer in Kibana. This is a great way to either do some ad hoc data analysis or to preview what our data is going to look like before we integrate it into our data pipeline. Next i am going to look at how to manage the lifecycle of an index. Elasticsearch clusters don’t have an infinite amount of capacity. At some point we have to roll our data over,
move it to slower nodes maybe, and eventually delete it. So we’ll talk about how to create policies to automatically manage that in Elasticsearch.
Then, lastly, i am going to talk about how to integrate time series data into
Elasticsearch using data streams.
The basic anatomy of an index comprises of 3 main things. First, is aliases. Aliases allow to basically simplify how we reference either 1 or multiple indices. Let’s say you have an index for App Logs, you have App Logs 1, 2, 3, 4,
or maybe it’s App Logs — some date and you’re doing a daily index of that log data. You can assign every single index an alias of app_log. That way, you can just reference app_log and search across all of those indices with just one reference. The next index component here is going to be mapping. Mappings allow you to actually map your data, your fields, to a data type, sometimes even 1 or more data types. Then you can determine whether or not your text fields are analyzed or your numerical values are integers or floats. Mappings can be both explicit or dynamic. And the third main component of indices is settings. Settings allow us to customize the behavior, the allocation, the overall configuration of an index. How many shards does it have? How many replica shards? Does it have an ILM policy? Is it a data stream? All these different things we can define in settings.
Index templates allow us to manage 1 or more indices, and we can do this by defining an index pattern. This is, basically, almost like a regular expression
that will automatically match any index with a certain name that matches the pattern, and then it will manage that index based on the template.
Index templates also allow us to use some more advanced features.
We can create data streams, we can assign index lifecycle management policies and snapshotting policies, and much more. Finally, we have a new feature to Elasticsearch is actually component templates. Not only can we create a template to manage a bunch of indices, but we can actually create component templates to manage parts of a template. As mention above
that we have aliases, mappings, and settings as part of our indices.
We can create a component template that’s maybe a certain mapping or a certain setting, and then use those within index templates, so all of our index templates have the same setting, but we don’t have to define them individually. We can just define them once in a component template
and then compose our index templates from those.
Whenever you have some new data that you want to integrate into your data pipeline, but you want to see what it looks like first. You’re not sure how to configure it. You’re not sure how to configure the pipeline, your processors, your index, or your templates, you just want some way to quickly ingest the data and just see what it looks like. The good use case here is just for some ad hoc data analysis. You got some flat file’s worth of some data, you want to see what that looks like, as well, in some sort of analysis tool. The file formats we can use here are CSV and TSV, these are delimited text files. Newline-delimited JSON, or NDJSON. And also log files. The one thing with log files
is you need to make sure that your log files use a consistent format, both for the log and for the timestamp. If you have different timestamp formats
or log formats in the same file, you’re going to run into some issues here
with the Data Visualizer.
Index LCM policy
With index lifecycle management policies, or ILM policies, this allows us to manage the lifecycle of our indices, because Elasticsearch does not have
an infinite amount of capacity. We have to move stuff through a series of phases. We have hot phase, warm phase, cold phase, and delete phase.
There is no rule saying that you have to use all these phases — you do have to use at least 1, typically the hot — but ILM allows us to define multiple phases
to set different criteria for each phase. We can determine how our index transition into these phases. Typically, it’s going to be based on either
the age of the index, or the size of the index in terms of actual space, the primary shard size, or even the document count. When we do transition to a phase, we could perform some phase executions, or actions. So these could be to rollover an index. We could force merge an index, which is a way of optimizing an index for readability and for re-performance, rather than indexing. We can migrate the index to different nodes. We can shrink it in terms of the number of shards it has. We can freeze it, which is a way of dramatically reducing the amount of resources it takes up in the cluster.
Frozen indices are read only, and it just barely stores enough metadata in memory to keep track of that index. And then ultimately we can also delete data for when it’s retired and it’s no longer useful to us. When your data is first indexed, particularly time series data, it’s what we call hot data.
These are indices that are actively being written to and frequently queried.
Typically you’re going to have these on your fastest nodes. Then we can move into the warm phase. You might have these nodes to be more read optimized
instead of write optimized, because now the index is no longer being written to, but it’s still being queried. So no longer indexing and probably not being queried as much as hot data. These nodes don’t have to be quite as performant
and the storage can be more read optimized. Then we move into a cold phase.
This is going to be storage optimized hardware, because in this case, we’re worried more about capacity. We want to store data really densely for a long time, because this index is no longer being written to and is queried very infrequently. This would be a good time to maybe freeze the data, to make it use even less resources in your cluster while still being accessible to queries.
And then lastly, we have our delete phase. This is where we’re going to retire our data because it’s no longer useful to us and we just don’t have the space for it.
Data streams is a really great way of taking time series data, spreading it across multiple indices with automated rollover and lifecycle management,
but then referencing it both with indexing and search requests as a single resource. You don’t have to reference all of the indices. Data streams will actually abstract all of the indices, and in this case they’re called backing indices. These called .ds for data stream, then — the data stream name (in this case it’s just data-stream), and then the numbered index. Whenever we want to read from the data stream, we can actually just read the data stream name we don’t have to read against all of the indices and the data stream will automatically redirect that read request to all the backing indices.
Similarly for indexing, we don’t have to reference the specific index name.
We just have to reference the data stream and it will automatically reference
whatever the hot index is, whatever index is actively being indexed to.
It’s just a really intuitive way of indexing and reading from a continuous time series stream of data. For all of this to work,
we need an ILM policy, an index template, and then we can actually create a data stream.