Autocompletion for Public Transportation

June 8, 2019

Elastic Stack

In our previous articles, we introduce the techniques to create a suggestion for typeahead searches. In this article, we do an example of how to utilise it in public transportation. In Switzerland, Public Transportation is extraordinary. It is prevalent, the service and overall reliability are extraordinary. With public transportation in Switzerland, you can nearly arrive at any destination in Switzerland comfortably.

As you know, the SBB Mobile App is one of the most popular apps for your smart-phone in Switzerland. As you plan your trip, you have to enter the destination that you wish to travel. For the respective input entry, all public transport stops have to be check if they match the entry. See below an example for the central train station for Morat (French) or Murten (German).

SBB Mobile App

Disclaimer: We have no affiliation to the SBB or any other public transportation organisation. This example only shows the capability of Elasticsearch to enhance search experience in the public transportation sector. All respective applications and their functionality are the responsibility of their owners.

Stats for Nerds

Time spent writing: 1 hour 23 minutes
Estimated reading time: 10 minutes
The most played song during writing: Long Train Runnin' by the Doobie Brothers
Photo by Samuel Zeller on Unsplash

The Data

The Federal Office of Transport in Switzerland is the supervisory authority responsible for public transport in Switzerland (railways, cableways, ships, trams and buses). All public stops are available on opendata.swiss. It is excellent demo data. A great thank you to the e-government that makes it possible to use it.

We don't describe how we transformed and ingest the data into Elasticsearch since this exceeds the purpose of this article. I give you more valuable information: How to examine the data for later analysis. Overall it took only 15 to 30 minutes with several methods and tools. I can only relate to the following tweet:

30 Minutes

Above link provides the Download link for all stops of the public transportation. The archive contains another archive with the name Lieferung_HST_20181209_1_CSV.zip, that holds our processed information in the CSV (comma separated values) format.

Pay attention that we are dealing with Swiss Data. Therefore we have field names in the German language. Switzerland has four official languages, where German, French and Italian are the most common languages in Government. Romansh is the fourth official language but barely used and spoken except in Grisons (Graubünden). I translate respective fields of interest for you.

Suggestions Approach

To create excellent suggestions, we provide a recipe or template, that you might adopt or use in other projects. Use it on your own risk and responsibility. My motivation is to demonstrate that sometimes, it is the combination of solutions to create appropriate completion proposals.

We use all previously presented methods for a full monty example.

Index Mapping

Step 1: Analyze the data for suitable types.

{  
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 10.6507225,
    "hits" : [
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040616",
        "_score" : 10.6507225,
        "_source" : {
          "x_Koord_Nord" : "200621",
          "@timestamp" : "2019-06-03T19:54:38.230Z",
          "host" : "omega",
          "Nummer" : "8590028",
          "Name" : "Bern, Bierhübeli",
          "TUNummer" : "306",
          "Stand" : "20181209",
          "Verkehrsmittel" : "Bus",
          "Betriebspunkttyp" : "Haltestelle",
          "Hoehe" : "555",
          "@version" : "1",
          "xtf_id" : "ch14uvag00040616",
          "Abkuerzung" : "",
          "rUebergeordneteHaltestelle" : "",
          "DatenherrAbkuerzung" : "",
          "TUAbkuerzung" : "SVB",
          "EndeGueltigkeit" : "",
          "y_Koord_Ost" : "599934",
          "BearbeitungsDatum" : "20181122",
          "GdeNummer" : "351",
          "GdeName" : "Bern",
          "BeginnGueltigkeit" : "20071126"
        }
      }
    ]
  }
}

A quick scan and we can cluster the information in their data types:

| Field               | Type            | Description                 |
| ------------------- | --------------- | --------------------------- |
| Nummer              | keyword         |                             |
| Name                | text, keyword   | Use Edge NGram Tokenizer    |
| Betriebspunkttyp    | keyword         | Type of stop                |
| Verkehrsmittel      | keyword         | Type of public transport    |
| =================== | =============== | =========================== |
| TUNummer            | integer         | Transportation company id   |
| TUAbkuerzung        | keyword         | Transportation company      |
| =================== | =============== | =========================== |
| Stand               | date            | Date format yyyyMMdd        |
| BearbeitungsDatum   | date            | Modification date           |
| BeginnGueltigkeit   | date            | Valid from                  |
| EndeGueltigkeit     | date            | Valid to                    |
| =================== | =============== | =========================== |
| GdeNummer           | integer         | Community number            |
| GdeName             | keyword         | Community name              |
| =================== | =============== | =========================== |
| x_Koord_Nord        | integer         | Geo-data: x on custom map   |
| y_Koord_Ost         | integer         | Geo-data: y on custom map   |
| Hoehe               | integer         | Altitude                    |
| =================== | =============== | =========================== |

The respective template:

PUT _template/haltestellen
{
  "index_patterns": [
    "haltestellen"
  ],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase",
          "filter": [
            "asciifolding"
          ]
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "@version": {
        "type": "keyword"
      },
      "Abkuerzung": {
        "type": "keyword",
        "ignore_above": 256
      },
      "BearbeitungsDatum": {
        "type": "date",
        "format": "yyyyMMdd"
      },
      "BeginnGueltigkeit": {
        "type": "date",
        "format": "yyyyMMdd"
      },
      "Betriebspunkttyp": {
        "type": "keyword"
      },
      "DatenherrAbkuerzung": {
        "type": "keyword",
        "ignore_above": 256
      },
      "EndeGueltigkeit": {
        "type": "date",
        "format": "yyyyMMdd"
      },
      "GdeName": {
        "type": "keyword",
        "ignore_above": 256
      },
      "GdeNummer": {
        "type": "keyword",
        "ignore_above": 256
      },
      "Hoehe": {
        "type": "integer"
      },
      "Name": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "name_suggest": {
        "type": "completion"
      },
      "Nummer": {
        "type": "integer"
      },
      "Stand": {
        "type": "date",
        "format": "yyyyMMdd"
      },
      "TUAbkuerzung": {
        "type": "keyword",
        "ignore_above": 256
      },
      "TUNummer": {
        "type": "keyword",
        "ignore_above": 256
      },
      "Verkehrsmittel": {
        "type": "keyword",
        "ignore_above": 256
      },
      "host": {
        "type": "keyword",
        "ignore_above": 256
      },
      "rUebergeordneteHaltestelle": {
        "type": "keyword",
        "ignore_above": 256
      },
      "x_Koord_Nord": {
        "type": "integer"
      },
      "xtf_id": {
        "type": "keyword",
        "ignore_above": 256
      },
      "y_Koord_Ost": {
        "type": "integer"
      }
    }
  }
}

The important parts:

The Name field contains analyzed text and a property keyword field.
We have defined custom analyzers for indexing and searching, used in the Name field.
We have a name_suggest completion field for the completion suggester.

Prefix Query

We can use the Prefix Query. We search with prefix Mur. The must_not part contains a term query that ensures we have only stops that are for public transportation and not maintenance. We search over not analyzed text.

GET haltestellen/_search
{
  "size": 3
  "_source": [
    "Name",
    "Verkehrsmittel"
  ],
  "query": {
    "bool": {
      "must": [
        {
          "prefix": {
            "Name.keyword": {
              "value": "Mur"
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "Verkehrsmittel": ""
          }
        }
      ]
    }
  }
}

We get these results.

{  
  "hits" : {
    "total" : {
      "value" : 76,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00054054",
        "_score" : 1.0,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "Muri AG, Bachmatten-Schulhaus"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00054053",
        "_score" : 1.0,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "Muri AG, im Roos"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00054072",
        "_score" : 1.0,
        "_source" : {
          "Verkehrsmittel" : "Zug",
          "Name" : "Murkart"
        }
      }
    ]
  }
}

Full Text Search

We search now for a word that is not in the beginning. We could also use the match_phrase_query, but since we have to use the Edge NGram Tokenizer, we search for neuve, cause we want to visit the city of La Neuveville, a beautiful town famous for wine.

GET haltestellen/_search
{
  "size": 3,
  "_source": [
    "Name",
    "Verkehrsmittel"
  ],
  "query": {
    "match": {
      "Name": {
        "query": "neuve",
        "operator": "and",
        "fuzziness": 2,
        "max_expansions": 10
      }
    }
  }
}

This gives us:

{
  "hits" : {
    "total" : {
      "value" : 78,
      "relation" : "eq"
    },
    "max_score" : 18.46783,
    "hits" : [
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00050476",
        "_score" : 18.46783,
        "_source" : {
          "Verkehrsmittel" : "Zug",
          "Name" : "La Neuveville"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00050481",
        "_score" : 16.262747,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "La Neuveville, La Main"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00050479",
        "_score" : 16.262747,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "La Neuveville, poste"
        }
      }
    ]
  }
}

Completion Suggester

We add suggestions for two documents or stops to increase their score.

For the stop Bern, Bierhübeli we update (add) the suggestion to the existing document.

POST haltestellen/_update/ch14uvag00040616
{
  "doc": {
    "name_suggest": [
      {
        "input": ["bier","bierhub","bierhubeli"],
        "weight": 20
      }
    ]
  }
}

For the city Biel (German)/Bienne (French) we add these suggestions.

POST haltestellen/_update/ch14uvag00041008
{
  "doc": {
    "name_suggest": [
      {
        "input": ["biel","bienne"],
        "weight": 10
      }
    ]
  }
}

Now we can combine the full-text search with the completion suggestions. We search for the term bier.

POST haltestellen/_search
{
  "_source": [
    "Name",
    "Verkehrsmittel",
    "name_suggest"
  ],
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "Name": {
              "query": "bier",
              "fuzziness": 2,
              "max_expansions": 5
            }
          }
        }
      ]
    }
  },
  "suggest": {
    "haltestellen-vorschlag": {
      "prefix": "bier",
      "completion": {
        "field": "name_suggest",
              "fuzzy": {
          "fuzziness": 2
        }
      }
    }
  }
}

The level of fuzziness allows us to guess that instead of bier a typo for biel might have happened. Both suggestions are in the suggested part.

{
  "hits" : {
    "total" : {
      "value" : 43,
      "relation" : "eq"
    },
    "max_score" : 9.669209,
    "hits" : [
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040926",
        "_score" : 9.669209,
        "_source" : {
          "Verkehrsmittel" : "Zug",
          "Name" : "Bière"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040929",
        "_score" : 8.5746,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "Bière, gare"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040933",
        "_score" : 7.70262,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "Bière, Praz-Béné"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00068565",
        "_score" : 7.450079,
        "_source" : {
          "Verkehrsmittel" : "",
          "Name" : "Bière-Jonction"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040928",
        "_score" : 7.450079,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "Bière, casernes"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040927",
        "_score" : 7.450079,
        "_source" : {
          "Verkehrsmittel" : "",
          "Name" : "Bière-Casernes"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040932",
        "_score" : 7.213572,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "Bière, La Tuilerie"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040931",
        "_score" : 7.213572,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "Bière, La Filature"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00040616",
        "_score" : 7.213572,
        "_source" : {
          "name_suggest" : [
            {
              "input" : [
                "bier",
                "bierhub",
                "bierhubeli"
              ],
              "weight" : 20
            }
          ],
          "Verkehrsmittel" : "Bus",
          "Name" : "Bern, Bierhübeli"
        }
      },
      {
        "_index" : "haltestellen",
        "_type" : "_doc",
        "_id" : "ch14uvag00059019",
        "_score" : 6.782917,
        "_source" : {
          "Verkehrsmittel" : "Bus",
          "Name" : "Schwyz, Bierkeller"
        }
      }
    ]
  },
  "suggest" : {
    "haltestellen-vorschlag" : [
      {
        "text" : "bier",
        "offset" : 0,
        "length" : 4,
        "options" : [
          {
            "text" : "bier",
            "_index" : "haltestellen",
            "_type" : "_doc",
            "_id" : "ch14uvag00040616",
            "_score" : 40.0,
            "_source" : {
              "name_suggest" : [
                {
                  "input" : [
                    "bier",
                    "bierhub",
                    "bierhubeli"
                  ],
                  "weight" : 20
                }
              ],
              "Verkehrsmittel" : "Bus",
              "Name" : "Bern, Bierhübeli"
            }
          },
          {
            "text" : "biel",
            "_index" : "haltestellen",
            "_type" : "_doc",
            "_id" : "ch14uvag00041008",
            "_score" : 20.0,
            "_source" : {
              "name_suggest" : [
                {
                  "input" : [
                    "biel",
                    "bienne"
                  ],
                  "weight" : 10
                }
              ],
              "Verkehrsmittel" : "Bus / Zug",
              "Name" : "Biel/Bienne"
            }
          }
        ]
      }
    ]
  }
}

Summary

We hope you could enjoy the auto-completion methods. It is no trivial task and needs constant tuning. We only scratch the tip of the iceberg, but this full monty example should have given you the insights and the conceptional idea behind the autocomplete functionality of Elasticsearch.