Autocomplete with Elasticsearch - Part 2: Index-Time Search-as-You-Type

June 5, 2019

Elastic Stack

In the previous article, we look into the possibilities of prefix queries to create suggestions based on existing data to enhance the search experience. We experience how fast and straightforward it could help us in the beginning. We also learned that it has some drawbacks like latency and duplicates if the data-set grows more significant over time. In this article, we are going to overcome the problems with Edge NGram Tokenizer.

Stats for Nerds

The most played song during writing: Los Angeles by The Midnight
Time spent writing: 54 minutes
Estimated reading time: 5 minutes
Photo by Émile Perron on Unsplash
We use Elasticsearch v7.1.1

Edge NGram Tokenizer

This explanation is going to be dry :scream:.

The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.

Source: Official reference

The default behaviour is:

With the default settings, the edge_ngram tokenizer treats the original text as a single token and produces N-grams with minimum length 1 and maximum length 2:

GET _analyze
{
  "tokenizer": "edge_ngram",
  "text": ["Los Angeles","Love","Paris","Pain"]
}

These examples create the terms:

[L, Lo, P, Pa]

For autocompletion, it needs adjustment. Did you mean Love or Los Angeles when you type Lo? So we have to enlarge the maximum length. The terms are case sensitive. Most users don't start with capital letters, so we need to lowercase the terms.

Define Autocomplete Analyzer

Usually, Elasticsearch recommends using the same analyzer at index time and at search time.

In the case of the edge_ngram tokenizer, the advice is different. It only makes sense to use the edge_ngram tokenizer at index time, to ensure that partial words are available for matching in the index.

That's why Elasticsearch refers to it as Index-Time Search-as-You-Type method.

We define the index wisdom to store quotes. The field quote has an index analyzer and a search analyzer.

PUT wisdom
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "quote": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

Let us analyze the quote by Damian Conway.

POST wisdom/_analyze
{
  "analyzer": "autocomplete",
  "text": "Documentation is a love letter that you write to your future self."
}

This results in these terms:

"do","doc","docu","docum","docume","documen","document",
"documenta","documentat","documentati","documentatio","documentation"
"is","lo","lov","love","le","let","lett","lette","letter",
"th","tha","that","yo","you","wr","wri","writ","write",
"to","yo","you","your","fu","fut","futu","futur","future","se","sel","self"

Now we index the quote.

PUT wisdom/_doc/112
{
  "quote": "Documentation is a love letter that you write to your future self."
}

Now we have the benefit of using a simple match query with fuzziness.

GET wisdom/_search
{
  "query": {
    "match": {
      "quote": {
        "query": "let luve wrote",
        "operator": "and",
        "fuzziness": 2
      }
    }
  }
}

We get our quote back:

{
   "hits" : [
      {
        "_index" : "wisdom",
        "_type" : "_doc",
        "_id" : "112",
        "_score" : 1.6925297,
        "_source" : {
          "quote" : "Documentation is a love letter that you write to your future self."
        }
      }
    ]
}

Conclusion

This approach is fast for queries and has no significant impact on large data-sets, but may result in slower indexing time and higher disk space consumption. The inverted index needs to store more data. We highly recommended reading the Definitive Guide, as there are additional examples, e.g. for zip codes. With the proper setup, this method might satisfy your autocomplete needs.

Elasticsearch offers a third alternative with completion suggesters which provides top-notch performance but requires more memory.

When you need search-as-you-type for text which has a widely known order, such as movie or song titles, the completion suggester is a much more efficient choice than edge N-grams. Edge N-grams have the advantage when trying to autocomplete words that can appear in any order.

Source: Edge NGram Tokenizer Reference

We are going to look into suggesters in the next article.

Stats for Nerds

Edge NGram Tokenizer

Define Autocomplete Analyzer

Conclusion

Cookies Heading Help Text