First step with Elastic
If you are making your first move in data you probably seek for a quick introduce you to some of the basic terminology and concepts of Elastic-search. For those who are new to Elastic-search and the Elastic Stack in general, this is going to be a quick lesson just describing the core components of the Elastic Stack and how Elasticsearch, which is what i am going to focus, fits into the overall stack.
First, we have Beats. Beats are the lightweight data shippers. These are clients that you install on the end-user machines or wherever your data is that you want to collect. There is a Beat for just about everything, logs, audio files, network packets and more. Once you’ve deployed the Beats, you’ve configured them to collect the data that you want to collect, you can send it either directly to Elastic-search, or if you want to parse and process that data further, you can send it to Logstash. Logstash is a data processing pipeline
and it really is actually a quite powerful processing pipeline on its own.
Even outside of the Elastic Stack, there’s a lot of use cases for Logstash,
and that’s because it can take inputs from anywhere and then send that data once it’s done filtering it to just about anywhere. In the Elastic Stack, you’re typically going to take inputs from Beats, and then you’re going to output to Elastic-search. But again, you can input from anywhere, output to anywhere,
and you can filter and mutate your data in between as much as you like.
Next is Elastic-search this is the search and analytics engine. It does both storage and analysis of data. A lot of other data tools will either only do storage or only do analysis, Elasticsearch actually does both. lastly, we have Kibana, the visualization interface and also a management console as well.
Kibana has an ever growing number of applications inside of it or plugins.
Elasticsearch, every time they buy a company, they incorporate their functionality into Kibana. They’ve added all kinds of things like machine learning, APM, and more advanced visualization tools. But the core concept of Kibana is to discover, visualize, and dashboard. More recently over the years, Kibana has also been a bit of a management console as well
for the entire Elastic Stack. There’s a lot of really good tools built into Kibana
to monitor and troubleshoot your cluster and also to interact with the Elasticsearch APIs.
The data architecture
let's start with some basic Elastic-search terms. First element is the elastic cluster. A cluster has a master node and data nodes and should be more than 2 to distribute the data. The master-eligib nodes are nodes that are able to be elected as a master. The master node is what coordinates the entire cluster, keeps track of where the data is, making sure it’s replicated, and keeping track of all the nodes in the cluster to make sure that we have a healthy cluster. Then we have the data nodes, these are the heavy lifters.These are the ones that actually store and perform searching on data.
This is where all that heavy read-write operations are being done and where our data is ultimately stored. But how the data is stored ?
For that we have 3 indexes. Data is going to be stored in indexes, and then indexes are going to be comprised of shards. That way, we can spread that data out across multiple nodes, to get better parallel computation and storage of that data, which is going to increase performance and search throughput and overall storage capacity of the index. This is how we’re able to horizontally scale our distributed storage. We also have 2 different types of shards, we have primary shards, and we also have replica shards. So primary shard and replica shard are absolutely identical. They have the same data in them. It is a document-for-document copy between these two. The replica serves the purpose for redundancy and fault tolerance, and also a little bit for search throughput as well. If anything were to happen to our first data node , the primary shard would be lost. The master node would immediately recognize that, and it would come over to the replica shard, promote it to a primary, and then it would start to re-replicate that replica shard somewhere else. Using replicas is how we can handle this fault tolerance through redundancy, and then also with search throughput. For example if I have a bunch of clients searching data on a data index , I have a bunch of search requests coming in and they’re hitting the primary shard and replica shard really hard on that node, and i have copies of this data elsewhere that I can actually perform parallel search operations because I have copies and redundancy of that data. That’s going to increase my search throughput, not my search speed. So speed is going to be a hardware dependent on your actual hard drives and the overall sizing and distribution of your cluster. The search throughput, how much searching can happen on the same data at the same time can be improved with replication.
There are different degrees of replication, if you have primary shard 1,
replica shard 1, and then another replica shard 1 you can have (n+1) as many replicas as you like, as long as you have enough data nodes to support them. It’s worth noting that replica shard 1 and primary shard 1 can never exist on the same node. The master node will not allow it.
So if you only have one data node, you can’t have any replicas.
You’ll never be able to allocate them, because there’s no point in redundancy of data if that redundancy exists on a single point of failure.
All replicas of a shard have to exist on a different node from the primary shard. Cluster state, is how a cluster define index state that host the replicas, it used colors and green state the replicas are replicated and will be available in case of a failure. For each state we have a different color of the index can be either green, yellow, or red. Green means that all of primary
and desired replica shards are allocated. Yellow means that one or more replica shards are not allocated, but all of my primaries are still allocated, so we’re not losing any data. It’s just some of the replication is missing. And then red means that we are missing primary shards and there aren’t any replicas to promote. So that’s a very, very bad state, usually coincides with data loss. let’s assume a third data node failed for some reason. Maybe it’s in a data center that had a power outage or a network failure,
and it’s been removed suddenly from the cluster. Index 1, is going to enter a yellow state Because we still have all of our primary shards, since we don’t have 2 degrees of replication anymore.
That was a tip of the iceberg with elastic-search.