First steps, searching for data with elastic
To understand the concepts and tools of data search with elastic we need to understand some terms and concepts relating to the Elasticsearch Query DSL.
That’s the first thing i am going to cover in this post. Then i am going to cover term-level search queries followed by full-text search queries. These are non-analyzed and analyzed search queries. I will cover what that means as well.
I am going to take it a step further by presenting some complex queries,
which are compound queries. A combination of multiple queries, all in one,
using some Boolean logic. For those use cases i am going to present a very long running query that you don’t want it synchronously running and waiting for the response, i am going to show how to create and execute asynchronous search queries. Basically a fire and forget query that you can come back to and check the results of later. Last, i will explain how to execute cross-cluster search queries. When you have more than one cluster, a remote cluster configured on another cluster and searching across both of those from a single cluster. Trust me, it’s a lot more simple than it sounds. let’s go ahead and jump in and start searching some data in Elasticsearch.
Lets first understand the Elasticsearch query DSL, which is a domain-specific language. We need to understand some terms and concepts before we start getting our hands dirty. Frst thing we need to cover is the difference between analyzed and non-analyzed fields and ultimately analyzed and non-analyzed searches in Elasticsearch. So let’s compare and contrast these 2.
We have the original text here, which is the same for both:
“The students learned a NEW concept.” On the analyzed side,
the first thing we need to do is we want to break this text up into individual tokens, because we want to be able to search this field with search clauses that are only partial matches. We might tokenize this into something
like “students, learned, NEW, concept.” Depending on the analyzer being used, the tokenized step here could look a little bit different. There’s domain-specific analyzers. By default, the standard analyzer would actually probably break this up into “the students learned a new concept.”
In this case, I’m showcasing the English analyzer, which is language specific.
It has a little more context with the English language. In this case, it knows to get rid of the words “the” and “a,” because they’re just not very important.
On the non-analyzed side, we’re really not technically tokenizing anything at all. Or if you want to think about it another way, we are tokenizing it, but it’s just one big token. The text is not being morphed into anything else.
We’re just storing it literally. To take the analysis another step further,
we want to normalize each of our tokens because currently we have the token “students,” but I want to be able to search for the word “student,”
or we have “learned” here, which is a verb. I want to be able to say “learns” or “learning,” so what we can do is we could normalize these tokens into their base words. Now we have “student learn new concept. The other thing we can do here is also change the case of everything. Typically you’re going to lowercase everything. Now we have case-insensitive base words of the original text. On the non-analyzed side, we do have the option to normalize our original text. It’s not typical to do this with non-analyzed fields, but it is an option. When you’re specifying the field mappings for an index, you can actually add — in this case, we’re showing a lowercase normalizer for our original text. So what does this actually mean? How does storing our data like this impact the way we search it? let’s go ahead and take a look at that.
Our original text is being stored as “student learn new concept” for our analyzed field and our non-analyzed field, on the other side is exactly the same, “the students learned a new concept.” For analyzed search here,
all of these search clauses would actually match a document that had our stored field. Students, new concepts, how to learn, never stop learning.
We have all different forms of the base word with different punctuations.
But once we actually analyzed our search clauses, the same way we analyzed our stored field, we can then match up the tokens that are the same and create a match. But how is this going to be a match? One thing that we can do when we normalize our fields is we can actually configure synonyms.
“learn” and “understand” are synonyms. “New” and “fresh” are synonyms,
and “concept” and “idea” are synonyms. Not only can we match different forms of the words based on how they’re tokenized and normalized, but we can actually match tokens that are technically completely different, structurally, but semantically, they are the same. On the non-analyzed side, a non-analyzed search, the only way I can ever match a document with this stored field is if I search for the exact same text, word for word, letter for letter, case for case, and the same punctuation.
Non-analyzed search is very much just true or false. It matches, or it doesn’t.
Whereas analyzed search is a bit more fluid, We’re kind of figuring out how well something matches our search, which brings up an interesting question.
If we’re doing an analyzed search on an analyzed field, how do we know which matches are more relevant than others? We solve it with relevancy scoring. Relevancy scoring tells us how well a search clause
matches a document. This is a very complicated process in Elasticsearch,
and this is kind of like its own science.I could do a post on how relevancy scoring works. But for the purpose of this one I am just going to scratch the surface here a little bit.First thing we’re going to look at when determining the relevancy of a batch is term frequency. How often does the term, the search term, appear in the field that we matched? The more times it appears, the more relevant that search term is. Next, we’re going to look at inverse document frequency. How often does the search term appear in the entire index we’re searching against? The more often a search term appears in the index, regardless of the actual match documents, the more relevant it’s going to be. The last aspect we’re going to look at here is the field length normalization. How many other terms are in the field? The more terms there are in the field, along with my search term, the less relevant that document or that match is going to be. The fewer the terms in the field, along with my search term, the more relevant that match is going to be. This is just scratching the surface of how relevancy scoring works. Last thing I want to cover in this section of the post is going to be the difference between the query and filter context. As we’re creating more complex search queries, oftentimes we want to filter down the data set that we’re searching against before we actually do any queries.Sometimes we want to query the data without effecting relevancy scoring. Sometimes we want to query the data while generating some kind of relevancy score. We can do this by leveraging query and filter contexts. The query context basically asks the question, how well does the search clause match the document? And the answer is going to be a relevancy score. So searching within the query context will generate relevancy scores. On the other hand, searching within the filter context, we’re basically asking the question, does the search clause match the document? Yes or no, true or false? This is very much a boolean output here, but more importantly, all of our queries within the filter context do not affect the relevancy score in any way.
Term level search queries
I covered in the previous section, the difference between something being analyzed and non-analyzed. Term-level queries are non-analyzed search queries. This means that the search clause itself is not going to be analyzed,
by searching for documents that are exact matches. For example we got a search for the word “yes,” all lowercase, you may get zero results. But if we search for the word “Yes” , you may get some results. If this was an analyzed search, both of these search terms would get the same results, because we’re not doing an analyzed search, we have to have an exact match.
I did also mention the previous section, that you can in fact, normalize search terms for term-level queries. However, you need to do this at the data mapping phase, when you’re first defining how your data is being stored in Elasticsearch. I haven’t covered data mappings yet, but i will cover it in my upcoming posts. You have to configure a normalizer for your keyword or term-level queries. Note, you want to make sure that you avoid using term-level queries, so non-analyzed queries, on analyzed fields. If you do you’re going to end up getting some really weird results and unexpected results.
Elasticsearch will let you do it, but it’s generally not a good idea.
Now, Elasticsearch has dozens and dozens of queries available to you.
I am not going to cover them all, just going to cover a few. The reason for that is, once you understand how a few queries work, the syntax of them,
you can pretty much write a query for anything. It’s going to be very important that you kind of look at the documentation, and you familiarize yourself with the query SDL portion of the documentation, so that you know all of the queries that are available to you.
The most basic term-level query that i am going to cover first, is just called the term query. Basically, it search for one value in one field. Super simple.
In this case, if you search for the a DestCountry for example out of certain countries that appears in your index. The DestCountry should be added in the “FIELD” space. Obviously keyword data types here are non-analyzed strings.
Full text search queries
Contrasted to the previous lesson, full-text search queries are analyzed searches. Searching for documents based on analyzed values. If you look on the right here, we’ve got 2 searches, the same 2 searches we did in the previous section. Both queries look for “yes.”
One is lowercase, the other is uppercase with a period, and they’re both getting the exact same number of results, because this is being tokenized and normalized, and then compared to that same process within the documents.
Now, whenever you’re doing a full-text or analyzed search, the analyzer that’s being used against your search clause is always going to be the same analyzer that was used on the field that was stored in Elasticsearch. It is possible to create your own custom analyzers to have complete and granular control
over how your data is analyzed when it’s stored, and then when you go to search your data, it’s going to use that same custom analyzer at search time as well. We want to avoid using term-level queries against analyzed fields.
And vice versa is true here, we want to avoid using full-text queries
on non-analyzed fields. You want either both the search and the field to be analyzed or not analyzed. You don’t want to intermix those.
But keep in mind Elasticsearch will let you. It will give you some results,
but those results are going to be oftentimes unexpected, and I can’t really think of a scenario in which you would want that kind of behavior. Another query to know is a match query. Very simple, search a field, with a text entry.
Text data types are analyzed. A match_phrase query is another option to match text but it uses the exec text you provide and wish to find.
You can search text entry for this same exact search clause,
but remember you only going to get one entry with some hits. for example you want to scan your access linux OS logs with root access so you use the text “root login”
Compound search queries
Compound search queries are a combination, typically a boolean combination, of other queries. A bool query is the most common combination query. It’s broken up into different search clauses. The first clause is going to be “must.” Here is an example of the terms that appear(there are more) that combines the query.
The other clause is “should.” These are search terms that should appear,
and you can actually configure the minimum amount of should queries that have to appear. You may have 5 queries in the should section and you only want 3 of them to match, so at least 3 of the 5 have to match. Queries are also scored.
Then we have queries within the “must_not” clause. This is search terms that cannot appear, and anything in this such clause is not relevancy scored.
Lastly, we have the “filter” clause. The filter clause lives within the filter context and so all search terms in here must appear, but they’re not scored.
This is basically the same as the must clause but there is no relevancy scoring.
The final relevancy score for the boolean query is going to be a combination
of the must and should clauses.
Asynchronies search queries
So far i covered examples writing what is called synchronous searches.
These are searches that we fire off, and we sit there and wait until the results come back. But what happens if you have a big (vert big) query or you just have a ton of data that you have to search, and it takes a really long time for that search to execute? Thats asynchronous searches. This is a way to fire off a search and then just come back later when it’s done. We don’t have to sit there and wait for it to complete. First, we need to write the query, and instead of doing a GET request to the search API, we’re going to do a POST request to the _async_search API. That, honestly, is pretty much the only difference. The actual query portion itself is exactly the same as the _search API. Once we fire off that query, we’re going to get a response back that includes the _async_search ID. We can use that ID to check the status of our asynchronous search because, we fire this thing off without waiting for a response,
we need a way for checking on the progress of that search. By firing off a request to the _async_search status API, with the ID of our _async_search,
we can actually see the status. We can also get the results of our _async_search using the ID as well. It’s worth noting that you can get the results of an asynchronous search while it’s still executing. You can actually get partial results, which is really useful for particularly long searches, but also if you’re using this as the backend search engine for a website or an application. If the user fires off a search, you want to be able to show live updates as that search is running. Lastly, when we’re all done with the search,
you can delete the asynchronous search.
Cross clusters search queries
How to execute searches across multiple Elasticsearch clusters?
This allows us to configure a cluster to be, essentially, a coordinating cluster
that we can search remote clusters through. We can do this against one or more remote clusters, there’s really not a limit to it.
Lets assume we have 3 clusters total, and the client is going to want to search
across 3 clusters just by using Cluster number 1. The client fires off their search request. Cluster 1 goes ahead and searches its local data. It also fires off that search request, and proxies it to the other 2 remote clusters.
Each of those clusters is going to return their results, Cluster 1 is going to combine and concatenate all 3 search results together for Cluster 1, 2, and 3,
and then it’s going to send that data all back to the client. So really quite straightforward. It’s actually a lot easier to set this up
and to use it then it probably sounds. how do we set it up ? In order to configure a cluster as a remote cluster, we need to do is a PUT request
to the cluster/settings API. This allows us to set all kinds of cluster settings,
and we can set these in 2 different ways. We can either do a transient setting,
which is a setting that resets back to default after a restart, or a persistent setting, which does not reset back to its default value
after a restart. If you select persistent go ahead and do a cluster setting remote, and then give the cluster a name, You can call it whatever you want.
And then define the seeds. Seeds is an array of nodes of this other cluster
that this cluster needs to reach out to. The private IP address of teh remote nodes and then define a port. Now there’s 2 main ports that Elasticsearch uses. 9200 is the port for the HTTP interface, so the client interface.
That’s how we interact with Elasticsearch APIs is over port 9200, but Elasticsearch nodes use the transport network, which uses port 9300.
This is the internode communication port. This is the port you want to use when configuring remote clusters. Some put 9200 because that’s just the default port they always use when interacting with Elasticsearch,
but it’s important to remember that you need to use the internode communication port, which is 9300, when doing remote cluster search.
This is how we do cross-cluster search in Elasticsearch.