Mirror, Mirror on the Wall

November 11, 2019

Reading the title of this blog post, you will likely associate the title with the fairy tale Snow White and the Seven Dwarfs. An association is a connection in mind for two related terms. It is a creative process that the human brain is so good at it. Another creative process is using synonyms.

They say that beauty is in the eye of the beholder. Synonyms in Elasticsearch (Apache Lucene under the hood) is an excellent technique to improve the search experience. If you are a Search Engineer or Data Scientist, this might be interesting for you. We will provide some examples of how to utilise this technique in an e-commerce scenario.

Synonym Explained

The Greek origins of the word are the prefix σύν (syn, “together”) und ὄνομα (ónoma, “name”). A synonym is a word having the same or nearly the same meaning as another word in the same language, domain or context. If you think of the word freedom, synonyms are liberty and indepedence. There are related in the meaning.

Synonyms in Elasticsearch

How is this related to Elasticsearch? For a search engine, it is essential to know which terms in documents and queries should match, even though they look different. Since this is highly domain-specific, users need to provide the appropriate rules.

This can range over

Synonyms in Action

The following use case was a recent discussion between our software architects. Imagine you are selling computer components in an online shop. As responsible Search Engineer, you want to ensure that potential customers find relevant products.

storage-synonyms

If you search in a german E-Commerce shop for "Festplatte" (Harddisk), you get relevant results thanks to the synonyms. You could just look for storage, and instead of an HDD (Hard Disk Drive), you are looking for an SSD (Solid State Drive).

E-Commerce Example

Create products index with synonyms. Synonyms can also be stored in synonym files. For the simplicity of our example, we deliver them directly in the settings.

PUT /products?pretty
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "synonym"
            ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms": [
              "harddisk, hard-disk, hdd",
              "solid state disk,solid state drive, ssd"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "synonym"
      },
      "price": {
        "type": "integer"
      },
      "vendor": {
        "type": "keyword",
        "ignore_above": 64
      },
      "category": {
        "type": "keyword",
        "ignore_above": 128
      }
    }
  }
}

Example Data

Add some example data. We add an HDD and SSD. The third product is unrelated and our Flowable unicorn 🦄.

POST _bulk
{"index":{"_index":"products","_id":"1"}}
{"title":"Hard-Disk Seagate IronWolf (10TB, 3.5)","price":345,"vendor":"Seagate"}
{"index":{"_index":"products","_id":"2"}}
{"title":"Samsung 860 EVO Basic Solid State Drive (1000GB, 2.5)","price":145,"vendor":"Samsung"}
{"index":{"_index":"products","_id":"3"}}
{"title":"Unicorn BPMN","price":42,"vendor":"Flowable"}

Search for Product Title

We search for HDD. HDD is a synonym for Hard-Disk. We use the lowercase filter, so searching for hdd is ok.

GET products/_search
{"query":{"match":{"title":"hdd"}}}

The ranking score of the HDD document outweighs the score of the SSD document.

{
  "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.1324935,
        "_source" : {
          "title" : "Hard-Disk Seagate IronWolf (10TB, 3.5)",
          "price" : 345,
          "vendor" : "Seagate"
        }
      },
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.4387857,
        "_source" : {
          "title" : "Samsung 860 EVO Basic Solid State Drive (1000GB, 2.5)",
          "price" : 145,
          "vendor" : "Samsung"
        }
      }
  ]
}

Search for SSD. SSD is a synonym for Solid State Drive.

GET products/_search
{"query":{"match":{"title":"ssd"}}}

We have now the opposite effect in comparison to the above.

{
  "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 3.9248123,
        "_source" : {
          "title" : "Samsung 860 EVO Basic Solid State Drive (1000GB, 2.5)",
          "price" : 145,
          "vendor" : "Samsung"
        }
      },
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.5158825,
        "_source" : {
          "title" : "Hard-Disk Seagate IronWolf (10TB, 3.5)",
          "price" : 345,
          "vendor" : "Seagate"
        }
      }
  ]
}

Mission accomplished. We have relevant search results. You could use your own custom application logic and remove documents from the search results that are lower, e.g. than 1.

As Search Engineer, it is better to let the customer decide what is relevant and improve it step by step by careful observations with tracking and analytics. It is a continuous improvement.

Search for Category

Sometimes it is useful to search for a less specific term. If you're not sure which term will be used, you can use a synonym search. You look for a storage replacement and get the HDD and SSD option.

Is adding storage as a synonym for HDD and SSD a good idea? Look below at the following synonym definition:

harddisk, hard-disk, hdd, storage
solid state disk, solid state drive, storage

In this small data set, the impact is low. The more synonyms you use, the less accurate the search results will be. It will weaken the score of the search results.

Instead of using storage as a synonym, use it as a category. Using the category type in the search as additional search criteria is far better. Therefore just add another keyword field to the document.

PUT products/_doc/1
{
  "title": "Hard-Disk Seagate IronWolf (10TB, 3.5)",
  "price": 345,
  "vendor": "Seagate",
  "category": [ "Storage", "NAS" ]
}

Additionally, you can provide the category search in conjunction with the synonym search.

Other Subtleties

Think of a job portal that uses synonyms for job titles. If you are looking as a recruiter for a software developer, you will likely also look for software engineers or even architects.

developer-synonyms

The scope is always tricky. Look at that synonym graph:

cleaner-synonyms

Would you also look for a janitor or housekeeper if you are searching for a room cleaner? It is highly domain-specific and depends on your customer's point of view.

Summary

Enhancing domain-specific search relevancy with synonyms is a powerful technique. This method can increase the precision of your search system. Still, many subtleties are essential to know and experiment with, especially in conjunction with relevance. Business is continuously changing. Your customer specific searches need to adapt too.

About the author: Vinh Nguyên

Loves to code, hike and mostly drink black coffee. Favors Apache Kafka, Elasticsearch, Java Development and 80's music.

Comments
Join us