ElasticSearch: fast and reliable search on your AEM platform

02 February 2016
Marek Krokosinski
Frink_Cognifide_2016_HeaderImages_0117

These days every web page needs to be performant. It has to have fast access to various kinds of data, based on different criteria.

We all know, that one of the main roles of Adobe AEM is to deliver content for the users. Usually there is a lot of content and users want to easily find what they are looking for, based on the page title, article body content or geolocation.

Adobe AEM is not a search engine, and probably it will never be. With each release performance is improved, but still if you will have few hundreds of thousands of articles, or images or any other data on your website, without the proper knowledge and without spending some time on customization of the out of the box lucene indexes, your users may have to wait few seconds, before they will get their responses. Even if they could accept that fact, that search on your website is a bit slow, you won’t have such features like geo localization or fuzzy search, because AEM – out of the box – doesn’t offer such functionality.

So I'd like to share with you what I think is one of the best search engines currently available - ElasticSearch. There are others, such as Solr but, in my opinion, ElasticSearch is the best because it’s easy to learn, the documentation quality is very good, it has an excellent community and new features are developed and released all the time.

About ElasticSearch

elasticElasticSearch was first released in 2010. So, compared to Solr, it is a young product which is growing in popularity. ElasticSearch claim they are used by companies like Facebook, Twitter or Wikipedia. You have to admit, that those companies are big fish and everyday more and more companies are choosing ElasticSearch for various reasons. I’m sure that one of the reasons behind its popularity is that ElasticSearch is an open source project which uses Apache Lucene under the hood. If it’s open source, it means it’s free (almost! Some products are licensed).

ElasticSearch boasts functionalities such as:

  • Simple, REST API – for developers and administrators
  • Easy to scale – ElasticSearch is very easy to scale, you can add new nodes to your cluster very easily. Then you can start playing with configuration of the primary shards and replica shards.
  • Schema free – there is no schema, so you can keep whatever you would like in your documents in the index, including nested documents!
  • Real-time analytics – add Kibana, and add analytics to your ElasticSearch instance

Featuring Search API

ElasticSearch has many features that might become useful in your daily work including a well-developed Search API.

This API boasts Search Aggregations, Suggesters, Highlighting, Search Templates, Percolate API and much, much more that you need to discover!

It is, in my opinion, the most important API from a developer point of view, but it is only one of many APIs which ElasticSearch has to offer. ElasticSearch creators have also provided:

  • Indices API – which can be used to manage individual indices, index settings, aliases, mappings, index templates and warmers.
  • Document API – which simply describes basic CRUD operations on you index
  • cat API – which makes ElasticSearch responses more readable for human eye (and I’m not talking about formatted JSON response)
  • Cluster API – can you guess what this API does? I hope you already know, with this API, you can check your cluster and nodes status, and you can get information about them.

Other standout features

ElasticSearch lets you specify mappings and analysers for fields in your index and lets you create custom analysers and tokenizers (albeit it provides quite a big list of built-in analysers and tokenizers). There are so many features (fantastic query caching mechanism for example) that I could describe here, but you can check them all out by visiting the ElasticSearch product page or the ElasticSearch Reference documentation

The project is still actively developed, and with each release we get a pack of improvements and new features. The newest stable version is 2.1.1. and the product just keeps getting better.

It’s also worth mentioning that ElasticSearch is available as a Service in the Amazon cloud, which makes management of your ElasticSearch instances much easier.

ElasticSearch and AEM

Integration of ElasticSearch with AEM, is fairly simple. The ElasticSearch team offers Java API which, as with all Elastic products, is very well documented and updated with each release.

Integration

To start using ElasticSearch you need to add the Maven dependency to your project:

<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>2.1.1</version>
</dependency>

It is also worth getting an OSGI Bundle from Apache:

<dependency>
    <groupId>org.apache.servicemix.bundles</groupId>
    <artifactId>org.apache.servicemix.bundles.elasticsearch</artifactId>
    <version>2.1.1_1</version>
</dependency>

Thanks to those dependencies you can start developing the business logic related to ElasticSearch. You can create a search box Sightly component which will send queries to elastic search and display the results in whatever format you would like.

It’s good to note that ElasticSearch data can come not only from AEM, but also from third-parties. How you organise your indexes and your documents is up to you.

Exporting data from AEM to ElasticSearch

There are two good ways to export your data from AEM to an ElasticSearch instance. First is the approach which is based on OSGI Events and Sling Jobs, and the second is based on Replication Agents.

Pre-requirement

To connect with your ElasticSearch, you will need a client. ElasticSearch in the Java API documentation describes how you can write such a client: Java Client.

There are two types of clients available out of the box: Node client and Transport client, they all have their downsides, so please choose wisely!

OSGI Events & Sling Jobs

In this case content, which comes from AEM may be indexed on demand, for example when one of our OSGI Events is fired. It’s worth to note, that EventAdmin will add our listener to the blacklist, if its execution time is longer than the configured timeout value (5000 milliseconds by default). To avoid such situations in your listener you should only add a Sling Job. It should execute logic which is needed to add a Sling Job, which will always be executed by Sling. Of course adding a separate queue may also be a good idea.

It might be good to use this approach if you need data to be indexed on demand, or for example if you want to index your data on the author instance (without activating your content).

Replication Agent

Another way of putting your content into ElasticSearch index is to create a new replication agent, which will do more or less the same thing as the Sling Job from the first approach. If this is the case, we could even use the same Sling Job. In this scenario our content will be pushed to the index during activation.

Maintenance

ElasticSearch is quite easy to maintain, thanks to the API and tools which are available. It’s easy (which doesn’t mean “cheap”) to get a “zero downtime” approach, thanks to simple functionalities like Indices Aliases.

For example, whenever we want to add a new field to the index, or change mappings or even re-index the entire content, we can add an alias to our old index, so it will be still available for the users, and we can create a completely new index with our new settings or content. When indexing process are done, we just set the alias for our new index. Users won’t even notice that anything has changed.

I'll tell you why it doesn’t mean “cheap”.  There is one simple reason. If your index size is very big, let’s say 1 TB, you will need twice as much disc space, because 1 TB will be reserved for the old index, and 1 TB will need to be reserved for the new index. Of course you can delete your old index later, but during the time that the data is being re-index, you need to have enough disc space and memory.

Maintenance is also very easy because of API which ElasticSearch provides. With a few simple requests you can create and configure your index, and later on you can get actual information about your cluster and nodes. If this is not enough, you can always use external tools, like ElasticHQ , and also if you are using AWS, Amazon offers its own panel to manage the ElasticSearch services.

How and where to start?

Here are some links which may help you to start (some of them were already mentioned in this blog post):

I recommend you check other products from ElasticSearch like Watcher or Kibana, because they integrate very well with ElasticSearch.