Similarity Search in Vector Space with Elasticsearch

June 17, 2019

With Elasticsearch, we can determine textual similarity. The current default algorithm for similarity search is Okapi BM25, but there is also built-in support for TF/IDF and a simple boolean similarity where the relevance score is solely based on whether the query terms match or not.

However, what if we wanted to find similar documents based on something more abstract - like the meaning of a word or the style of writing? This is where Elasticsearch's dense vector field datatype, and script-score queries for vector fields come into play.

Indexing Word Embeddings

Word embeddings are vector representations of words and are often used for natural language processing tasks, such as text classification or sentiment analysis. Similar words tend to appear in a similar context. Word embeddings map words which appear in a similar context to vector representations with similar values. This way, the semantic meaning of a word is preserved to some extent.

To demonstrate the use of vector fields, we imported the pre-trained GloVe word embeddings into Elasticsearch. The glove.6B.50d.txt file maps each of the 400000 words of the vocabulary to a 50 dimensional vector. An excerpt is shown below.

public 0.034236 0.50591 -0.19488 -0.26424 -0.269 -0.0024169 -0.42642 -0.29695 0.21507 -0.0053071 -0.6861 -0.2125 0.24388 -0.45197 0.072675 -0.12681 -0.36037 0.12668 0.38054 -0.43214 1.1571 0.51524 -0.50795 -0.18806 -0.16628 -2.035 -0.023095 -0.043807 -0.33862 0.22944 3.4413 0.58809 0.15753 -1.7452 -0.81105 0.04273 0.19056 -0.28506 0.13358 -0.094805 -0.17632 0.076961 -0.19293 0.71098 -0.19331 0.019016 -1.2177 0.3962 0.52807 0.33352
early 0.35948 -0.16637 -0.30288 -0.55095 -0.49135 0.048866 -1.6003 0.19451 -0.80288 0.157 0.14782 -0.45813 -0.30852 0.03055 0.38079 0.16768 -0.74477 -0.88759 -1.1255 0.28654 0.37413 -0.053585 0.019005 -0.30474 0.30998 -1.3004 -0.56797 -0.50119 0.031763 0.58832 3.692 -0.56015 -0.043986 -0.4513 0.49902 -0.13698 0.033691 0.40458 -0.16825 0.033614 -0.66019 -0.070503 -0.39145 -0.11031 0.27384 0.25301 0.3471 -0.31089 -0.32557 

To import these mappings, we created a words index and specified the dense_vector type for the vector field in the index mapping. We then iterated over the lines in the GloVe file and bulk-inserted the words and vectors into that index in small batches. Afterwards the word "early" could e.g. be retrieved with a GET request to /words/_doc/early:

{
  "_index": "words",
  "_type": "_doc",
  "_id": "early",
  "_version": 15,
  "_seq_no": 503319,
  "_primary_term": 2,
  "found": true,
  "_source": {
    "word": "early",
    "vector": [0.35948,-0.16637,-0.30288,-0.55095,-0.49135,0.048866,-1.6003,0.19451,-0.80288,0.157,0.14782,-0.45813,-0.30852,0.03055,0.38079,0.16768,-0.74477,-0.88759,-1.1255,0.28654,0.37413,-0.053585,0.019005,-0.30474,0.30998,-1.3004,-0.56797,-0.50119,0.031763,0.58832,3.692,-0.56015,-0.043986,-0.4513,0.49902,-0.13698,0.033691,0.40458,-0.16825,0.033614,-0.66019,-0.070503,-0.39145,-0.11031,0.27384,0.25301,0.3471,-0.31089,-0.32557,-0.51921]
  }
}

Cosine Similarity for Scoring

For the GloVe word embeddings, the cosine similarity between two word vectors can reveal the semantic similarity of the corresponding words. Starting from Elasticsearch 7.2 cosine similarity is available as a predefined function which is usable for document scoring.

To find a word with a similar representation to [0.1, 0.2, -0.3] we can send a POST request to /words/_search, where we use the predefined cosineSimilarity function with our query vector and the vector value of the stored document as function arguments to calculate the document score. Note that we need to add 1.0 to the result of the function because scores must not be negative.

{
  "size": 1,
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['vector'])+1.0",
        "params": {
          "queryVector": [0.1, 0.2, -0.3]  
        }
      }
    }
  }
}

As a result, we get the word "rites" with a score of approximately 1.5:

{
    "took": 103,
    "timed_out": false,
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": 1.5047596,
        "hits": [
            {
                "_index": "words",
                "_type": "_doc",
                "_id": "rites",
                "_score": 1.5047596,
                "_source": {
                    "word": "rites",
                    "vector": [0.82594,0.55036,-2.5851,-0.52783,0.96654,0.55221,0.28173,0.15945,0.33305,0.41263,0.29314,0.1444,1.1311,0.0053411,0.35198,0.094642,-0.89222,-0.85773,0.044799,0.59454,0.26779,0.044897,-0.10393,-0.21768,-0.049958,-0.018437,-0.63575,-0.72981,-0.23786,-0.30149,1.2795,0.22047,-0.55406,0.0038758,-0.055598,0.41379,0.85904,-0.62734,-0.17855,1.7538,-0.78475,-0.52078,-0.88765,1.3897,0.58374,0.16477,-0.15503,-0.11889,-0.66121,-1.108]
                }
            }
        ]
    }
}

To make exploring the word embeddings easier, we built a wrapper in Kotlin using Spring Shell. It accepts a word as an input, retrieves the corresponding vector from the index and then performs a script_score query to display the top results:

shell:>similar --to "cat"
{"word":"dog","score":1.9218005}
{"word":"rabbit","score":1.8487821}
{"word":"monkey","score":1.8041081}
{"word":"rat","score":1.7891964}
{"word":"cats","score":1.786527}
{"word":"snake","score":1.779891}
{"word":"dogs","score":1.7795815}
{"word":"pet","score":1.7792249}
{"word":"mouse","score":1.7731668}
{"word":"bite","score":1.77288}
{"word":"shark","score":1.7655175}
{"word":"puppy","score":1.76256}
{"word":"monster","score":1.7619764}
{"word":"spider","score":1.7521701}
{"word":"beast","score":1.7520056}
{"word":"crocodile","score":1.7498653}
{"word":"baby","score":1.7463862}
{"word":"pig","score":1.7445586}
{"word":"frog","score":1.7426511}
{"word":"bug","score":1.7365949}

The project code can be found on github.com/schocco/es-vectorsearch.

Limitations

The dense vector datatype is marked as experimental and stored vectors shouldn't exceed 1024 dimensions (500 dimensions for Elasticsearch <7.2).
Document scoring with cosine similarity is relatively expensive and should be used together with filters to limit the number of documents for which scores need to be calculated. For larger scale similarity search in dense vectors, you might want to look into more specific projects like the FAISS library for "billion-scale similarity search with GPUs".

Searching by Abstract Properties

We used word embeddings to demonstrate similarity in vector space with Elasticsearch, but the same concept should apply to other domains. We could map from paintings to vector representations that encode the style of the painting to search paintings which are drawn in a similar style. If we can map an abstract property like taste or style to vector space, then we could also search new recipes, clothing, or furniture that shares this abstract property with a given query item.
Thanks to the recent advances in machine learning, creating domain specific embeddings is supported by many easy to use open source libraries. The new script score query support for vector fields in Elasticsearch is another step towards easily implementable domain specific similarity search for everyone.

About the author: Rocco Schulz

Rocco loves clean air in the mountains and clean code in projects. He is interested in a multitude of sports activities, machine learning and various software engineering topics.

Comments
Join us