Autocomplete with Elasticsearch - Part 3: Completion Suggester

June 7, 2019

In the previous articles, we look into Prefix Queries and Edge NGram Tokenizer to generate search-as-you-type suggestions. Suggesters are an advanced solution in Elasticsearch to return similar looking terms based on your text input. Movie, song or job titles have a widely known or popular order. In this article, we are going to complete with a hands-on example.

Stats for Nerds

Completion Suggester

An excellent explanation from the official reference:

The completion suggester provides auto-complete/search-as-you-type functionality. This is a navigational feature to guide users to relevant results as they are typing, improving search precision. It is not meant for spell correction or did-you-mean functionality like the term or phrase suggesters.

However, it allows you to have typos, that you can adjust with fuzziness.

Ideally, auto-complete functionality should be as fast as a user types to provide instant feedback relevant to what a user has already typed in. Hence, completion suggester is optimized for speed. The suggester uses data structures that enable fast lookups, but are costly to build and are stored in-memory.

These data structures are weighted Finite State Transducers in short FST. For persons with a hungry mind, look at the source code on Github in org.elasticsearch.inde.mapper.CompletionFieldMapper. There is also a blog post from Elastic that describes the inner workings of FST.

There is a significant change. In previous methods, we have used the stored text in text and keyword fields. Now we store additionally suggestions in the document and hence we can tweak the rank of the document.

Hands-On

We simulate a career network that provides job opportunities. We need to define two fields in the job index.

  1. The title is a keyword field. It is only relevant for storing data.
  2. The suggest field is of type completion.
PUT jobs
{
  "mappings": {
    "properties": {
      "title": {
        "type": "keyword"
      },
      "suggest": {
        "type": "completion"
      }
    }
  }
}

Example Data

We store the following suggestion document.

PUT jobs/_doc/1?refresh
{
  "suggest": [
    {
      "input": [
        "Software Engineer",
        "Software Architect"
      ],
      "weight": 3
    },
    {
      "input": [
        "Software Developer",
        "Software Programmer"
      ],
      "weight": 2
    },
    {
      "input": "Software Manager",
      "weight": 1
    }
  ]
}

A second document:

PUT jobs/_doc/2?refresh
{
  "suggest": [
    {
      "input": [
        "Solution Architect",
        "Solution Designer"
      ],
      "weight": 1
    }
  ]
}

A third document:

PUT jobs/_doc/3?refresh
{
  "suggest": [
    {
      "input": "Engineer",
      "weight": 2
    },
    {
      "input": "Software Engineer",
      "weight": 1
    }
  ]
}

Query for Engineers

Now we search for Engineers. The user types eng.

POST jobs/_search
{
  "suggest": {
    "job-suggest": {
      "prefix": "eng",
      "completion": {
        "field": "suggest"
      }
    }
  }
}

Elasticsearch returns:

{
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "job-suggest" : [
      {
        "text" : "eng",
        "offset" : 0,
        "length" : 3,
        "options" : [
          {
            "text" : "Engineer",
            "_index" : "jobs",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 3.0,
            "_source" : {
              "suggest" : [
                {
                  "input" : "Engineer",
                  "weight" : 2
                },
                {
                  "input" : "Software Engineer",
                  "weight" : 1
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

The first rank is Engineer, since we do not know if he is really search for Software Engineer we put it on the second rank.

An input field can have various canonical or alias name for a single term. So we have covered the terms Engineer (doc 3) and Software (doc 2) to get a decent suggestion for Software Engineer.

Weights can be defined with each document to control their ranking. By typing eng we don't know for sure that the user is searching for Software Engineer (weight 1), but we can tell for sure it could be an Engineer (weight 2).

Query for Solutions

Now we query for the prefix sol.

POST jobs/_search
{
  "suggest": {
    "job-suggest": {
      "prefix": "sol",
      "completion": {
        "field": "suggest"
      }
    }
  }
}

Elasticsearch returns:

{
  "hits" : {
    "total" : { "value" : 0, "relation" : "eq" },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "job-suggest" : [
      {
        "text" : "sol",
        "offset" : 0,
        "length" : 3,
        "options" : [
          {
            "text" : "Solution Architect",
            "_index" : "jobs",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 1.0,
            "_source" : {
              "suggest" : [
                {
                  "input" : [
                    "Solution Architect",
                    "Solution Designer"
                  ],
                  "weight" : 1
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

Query with Fuzziness

Assume sol for Solution Architect was a typo and you are searching for Software Developers. You add fuzziness to the query.

POST jobs/_search
{
  "suggest": {
    "job-suggest": {
      "prefix": "sol",
      "completion": {
        "field": "suggest",
        "fuzzy": {
          "fuzziness": 1
        }
      }
    }
  }
}

Elasticsearch returns the three suggestions:

{
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "job-suggest" : [
      {
        "text" : "sol",
        "offset" : 0,
        "length" : 3,
        "options" : [
          {
            "text" : "Software Architect",
            "_index" : "jobs",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 6.0,
            "_source" : {
              "suggest" : [
                {
                  "input" : [
                    "Software Engineer",
                    "Software Architect"
                  ],
                  "weight" : 3
                },
                {
                  "input" : [
                    "Software Developer",
                    "Software Programmer"
                  ],
                  "weight" : 2
                },
                {
                  "input" : "Software Manager",
                  "weight" : 1
                }
              ]
            }
          },
          {
            "text" : "Software Engineer",
            "_index" : "jobs",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 2.0,
            "_source" : {
              "suggest" : [
                {
                  "input" : "Engineer",
                  "weight" : 2
                },
                {
                  "input" : "Software Engineer",
                  "weight" : 1
                }
              ]
            }
          },
          {
            "text" : "Solution Architect",
            "_index" : "jobs",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 2.0,
            "_source" : {
              "suggest" : [
                {
                  "input" : [
                    "Solution Architect",
                    "Solution Designer"
                  ],
                  "weight" : 1
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

Disadvantages

Besides the necessarily increased memory usage, matching always starts at the beginning of the text.

A search for business in the job title Senior Business Developer does not return any result. One way to overcome is to tokenize the input text on space and keep all the phrases as alternative names.

Senior Business Developer needs a suggestion document with these terms.

Senior Business Developer
Business Developer
Developer

The other way around must not match. The term developer may yield different results. Using suggestions is no trivial tasks, but you can generate suggestions based on existing data.

For instance, you can aggregate for jobs with the prefix dev and tokenize and filter all terms before, and store these results as a new suggestion document.

Tending and curating proper suggestions is a challenging task. We should not underestimate the effort to enhance the user experience for our customers.

Conclusion

The Completion Suggester is a state of the art auto-complete/search-as-you-type functionality provider.

In the next article, I demonstrate how to combine all auto-completion methods into a full monty example in the area of public transportation.

About the author: Vinh Nguyên

Loves to code, hike and mostly drink black coffee. Favors Apache Kafka, Elasticsearch, Java Development and 80's music.

Comments
Join us