Elastic, Data Processing
Let’s just take a minute here to quickly summarize all the things i have covered in my previous elastic posts. First, how to highlight search terms
and how to even change not only what we highlight, but the way we actually highlight it, then we discussed how to sort our search results, that we can do a hierarchy of search terms, and we can change the order. We can even change the mode of aggregation for multi-value fields. Then i paginated the search results by just setting a size and from offset value, thereby enabling to page through our search results by incrementing the from or the offset
by a multiple of the size value. Next, we used the aliases API along with some component and index template settings to create and remove aliases for various indices. Lastly, reusable, parameterized search template
so that the search applications don’t have to create these long queries
every single time we want a new page, or a different sorting method, or whatever it might be. We can just pass a list of parameters to Elasticsearch and let Elasticsearch do the rest. I can even configure the search templates
with some default values so that i don’t even have to pass all the parameters
if we don’t want to. So a lot of great features here built into Elasticsearch
to enable us to easily develop search applications with Elasticsearch.
Explicit Mappings
In elastic we can explicitly or manualy define the fields and their data types. We might want to do this in order to explicitly determine which string fields are analyzed. We cab also define which numbers should be integers, or floats, or percentages, or IP addresses, or whatever,
and then we also could customize the date format for date fields or timestamp fields. To use mappings first you need to look for the field names and the data types assigned to them. For example Geo_Point daya type should be assemble from latitude and longitude but those data type could be separate and not map into Geo_Point, thats a perfect example why you want to use mappings in order to map the latitude and longitude data type into a single one that will include both, it simplifies the search. We define explicit mappings in this case. Lets take another example a database with product reviews, when doing product_review_mapping to teh data index we probably find product_id that could be an integer, also first_name of the reviewer. We don’t need data set to be analyzed so i can use do keyword search. About the review it self, it will be a text type so a anlalyzed string field, which use standard analyzer, and set a language you wish to analyze the text. The result will be an explicit mappings that may be used in a templates that could could be used when needed depening on the explicit search we did. There are tons of possibilities in Elasticsearch. Definitely check out the documentation to see what data types are available, so that you know what to use and when.
Dynamic Mappings
The oposite of explicit is dynamic mappiong that could be automatically added new fields and assign their data types in Elasticsearch with dynamic mapping enabled, which is the default. Fields that are not already in an index, when you go to index them, it will be added automatically.
The mapping for that field, we automatically add it. And the way it will do that is to try to determine, with data detectors, what the data type should be.
It will automatically determine if it’s a date, or a string, or a number, if it’s a float or an integer. Elasticsearch does have some really good built-in logic to figure out what kind of data type it should be. Now, all of that is well and good, but a lot of times you’re going to want to customize this behavior, which is possible with dynamic templates. So we can actually customize the dynamic mapping behavior explicitly with dynamic templates. First we need to create a component template that will host the define the dynamic mapping, for example we can create a dynamic mapping that converts integers to floats that could use indexing a whole number and the data detector is going to see that and try to index it as an integer. We may want it to be a float, because maybe we’re doing averages or something, some sort of aggregation on that data, and we want it to be able to have decimal values. So how do we do integers as floats, first define under template, mappings but instead of defining the properties on the mappings need to define dynamic templates. You can have as many templates as you like. We need to specify the mapping type that we want to catch, and add match mapping type. Elasticsearch is going to index integers in a way that you can allocate larger numbers. And then going to define their mapping as a type called double, which is a float. Now, if we have an index where this component template is composing the index, whenever we ingest an integer, and we don’t already have a field for that integer. It’s worth mentioning that explicit mappings override dynamic mappings.
Custom Analyzer
To knwo custome analyzer in Elastic first we need to understand the anatomy of an analyzer. Character filte, is a string of text, that add, remove, or change individual characters in that string. There’s a bunch of different character filters that are available in Elasticsearch, way too many
that i could ever possibly go over. So definitely check out the documentation
for all of the available character filters. Then pass in to a Tokenizer. This is where we’re going to take that string of texts that already has character replacement done in it. And i am going to convert it into an array
of individual tokens. There’s different Tokenizers that all have different characteristics. The classic Tokenizer is really good for the English language.
The standard Tokenizer is kinda generally good for just about anything.
There’s a white space Tokenizer that only splits on white space, so on and so forth. There’s a bunch of different Tokenizers. Once we have our array of tokens, the last thing we can do is apply a token filter, or multiple token filters if we like. This will transform a token by adding, removing, or changing. Same thing as a character filter, except character filters are operating on every character in the string. The token filters are after tokenization and they’re operating on an entire token.
Multi Fields
What are multi-fields? multi-fields is just a way of indexing the same field
in multiple different ways. This way you can have a single field that’s actually mapped or indexed in Elasticsearch in a variety of different ways.
Some common use cases for this would be to map a string as both an analyze text field for full text searching, and also as a non analyzed keyword field
for doing aggregations. Another common use case would be for endless text fields. You may want to analyze that text in different ways. Maybe we want to analyze it with the standard analyzer, maybe the white space analyzer, or the english analyzer. You can do that with multi-fields.
Reindexing Documents
Reindexing documents in Elasticsearch. Main things we need to know for this is that we first need a source. We can specify either a local or even a remote source. Not only can we get this from a local or remote source,
but we can also filter it down with a query, so we can only get a subset of the index that we’re after. Once we have our source figured out, we obviously need a destination. We have to specify a local index to reindex into. This means in order to do a remote reindex, you have to be on the cluster
that you want the index to reside. Only the source can be remote, not the destination. You can also establish an ingest node pipeline, and run your data through the pipeline before it reaches the destination. In between source and destination, you can actually parse, and mutate, and enrich your data with ingest pipelines.
Updating Documents
To update documents in Elasticsearch there’s 2 different ways we can update a document in Elasticsearch. We can either do it explicitly for a specific document, which you would need the document ID for, or we can do it based on the results of a query so update all documents that match a specific query.
Some common use cases are for updating documents would be to pick up the mapping changes. If you wanted to change the mappings of an index, you can do that even after things have already been indexed but the data won’t actually pick up those changes until you update those documents. You can actually do an update_by_query with no query which means it updates all documents, and you actually don’t even perform any specific update. You just run that by itself and that will automatically update all the documents
to another version and pick up those new mapping changes. Another use case would be to update documents with a script to change their source values. This is a really flexible way to update your documents because you can apply any logic you want with the script. In this case, this always uses the language Painless, which is an Elasticsearch scripting language.
It’s really easy to use and understand and there’s really good documentation on it as well. Another use case is to update documents with an ingest pipeline.
Ingest Pipelines
Ingest pipelines in Elasticsearch allow you to process and enrich your data.
You can do this with the use of processors. There are a ton of different processors in Elasticsearch, and there’s no way i will be able to cover them all. You can have 1 or many processors and they’re all executed in order.
So you can do something on one processor that sets up something for the next. And as i described in the paragraph, you can also execute ingest pipelines with update_by_query and reindex APIs. For example lets go back to the location field, if we have latitude and longitude fields , but we also a location field, which is the lat and long as a geo_point. It may be the only field I really care about in location sense, because this geo_point field can be used in a map visualization and I want to use it. I don’t need these 2 fields, so they can be removed, simple as that. Same thing goes for timestamp field that can include all time infor and data insted of havving in in seperate fields. I want 1 field that tells me the date and time.
Nested Arrays Of Objects
In Elasticsearch, whenever you have a nested array of objects,
they end up getting flattened, which means the relationships of which values
belong to which object are lost. For example , we have 2 students, Amit Cohen and Tomer Oz . Now, if I do a search for Amit Oz , if I don’t handle this nested array of objects, that search is going to work and I don’t want it to,
because there is no Amit Oz in the student array list. In order to preserve the individuality of each of these objects, we need to use the nested data type.
If we properly use the nested data type, then when we go to do a search,
you will need to use the nested search, which would not allow Amit Oz to be a match. It would only match Amit Cohen or Tomer Oz.
Summary
let’s recap all the things i covered in this post. How to explicitly map fields
in Elasticsearch by defining the mapping properties within an index or index template. Then dynamically mapping fields. Not just explaining how Elasticsearch automatically maps fields dynamically with data detection,
but also how to override that function with dynamic templates.
Then how to define a custom analyzer, custom character filter and stop words, English stop words filter for it. Defined some multi-fields.
How Elasticsearch automatically indexes strings. Re-index documents.
Ingest pipelines a very powerful processing capabilities built into Elasticsearch.