你们好 - Elasticsearch and the Chinese language

December 19, 2019

Elastic Stack

Today we are looking into the Elasticsearch language support of Chinese. Chinese is spoken by the ethnic Chinese majority and many minority ethnic groups in China. About 1.2 billion people (around 16% of the world's population) speak some form of Chinese as their first language. We are an international company, so having customers in Singapore or Hong Kong makes it super interesting. Chinese consists of many dialects and mostly two written forms. In our first section, I will clarify which region uses what dialect and written form. After that, we are looking at what is supported by Elasticsearch or moreover by Apache Lucene.

This article was written with the assistance of 翁秋君 (Qiujun Weng) as a native language expert and speaker. She was so kind, to provide the examples and interpret the analysis results.

Chinese: Traditional and Simplified

The Chinese language has two official writing systems: traditional and simplified Chinese.

Simplified Chinese Characters are one of two standard sets of Chinese characters of the contemporary Chinese written language. They are based mostly on popular cursive (caoshu) forms embodying graphic or phonetic simplifications of the "traditional" forms that were used in the printed text for over a thousand years. The government of the People's Republic of China has promoted them for use in printing in an attempt to increase literacy.

See below what is spoken and variant is used for writing.

| Region         | Spoken    | Written     |
| -------------- | --------- | ----------- |
| Mainland China | Mandarin  | Simplified  |
| Singapore      | Mandarin  | Simplified  |
| Hong Kong      | Cantonese | Traditional |
| Macau          | Cantonese | Traditional |
| Taiwan         | Mandarin  | Traditional |

Chinese is one of many languages in Singapore. For Singapore, we got a special customer requirement. Although simplified characters are currently used in official documents, the government does not ban the use of traditional characters. Elasticsearch must support Simplified and Traditional Chinese.

Language Analysis Support of Apache Lucene

For Simplified Chinese, Apache Lucene provides support for Chinese sentence and word segmentation with the HMM Chinese Tokenizer and the SmartChineseAnalyzer.

Elasticsearch integrates Lucene's Smart Chinese analysis module into elasticsearch with the Smart Chinese Analysis plugin.

This analyzer supports Simplified Chinese text and mixed Chinese-English text. Traditional is not supported.

Simplified Chinese Analysis Example

For this demonstration, I use the most recent Elastic Stack Version 7.4.1. To demonstrate the capability, we need to install the Smart Chinese Analysis plugin. The plugin is not part of the official release.

In your Elasticsearch installation, you can install it with the following command:

bin/elasticsearch-plugin install analysis-smartcn

New plugins are only usable after a restart on the respective node.

Analysis of Simplified Chinese text

With the installation of the plugin, you can use it as an analyzer. We analyze the text The computer is new written in Simplified Chinese.

# The computer is new.
GET _analyze
{
  "text": ["电脑是新的"],
  "analyzer": "smartcn"
}

The tokenization is correct.

{
  "tokens" : [
    {
      "token" : "电脑", // computer
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "是", // is
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "新", // new
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 3
    }
  ]
}

Analysis of Traditional Chinese text

If we try to analyze Traditional Chinese with the analyzer for Simplified Text with the same text,

GET _analyze
{
  "text": ["電腦是新的"],
  "analyzer": "smartcn"
}

We get a different result:

{
  "tokens" : [
    {
      "token" : "電", // computer
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "腦", // computer
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "是", // is
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "新", // new
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 4
    }
  ]
}

Computer written in Traditional Chinese is 電腦. The tokenization is wrong because the word was split into two tokens.

Traditional Chinese Analysis Example

Since Simplified Chinese is predominant and the official language in Singapore, I would advise just to stay with Simplified Chinese. Traditional Chinese is also allowed. How do we possibly could support Traditional Chinese?

Instead of analyzing Traditional Chinese, we can simply translate Traditional into Simplified Chinese. Text Analysis for Simplified Chinese works. We have a decent official analysis plugin of Apache Lucene/Elasticsearch for that.

For translation, we can use STConvert Analysis for Elasticsearch plugin. STConvert is analyzer that converts Chinese characters between Traditional and Simplified. We use the direction Traditional to Simplified. To use the converter, we need to install it as a plugin. The following example installs the plugin into Elasticsearch.

# download release for elasticsearch
wget https://github.com/medcl/elasticsearch-analysis-stconvert/releases/download/v7.4.1/elasticsearch-analysis-stconvert-7.4.1.zip


# install plugin
bin/elasticsearch-plugin install file:///$(pwd)/elasticsearch-analysis-stconvert-7.4.1.zip


-> Downloading file:////home/john.legend/demos/elasticsearch-7.4.1/elasticsearch-analysis-stconvert-7.4.1.zip
[=================================================] 100%   
-> Installed analysis-stconvert

Restart your Elasticsearch cluster (node) to use the converter.

Create Test-Index

We create a text index, that has the tokenizer tsconvert. It converts Traditional into Simplified Chinese.

PUT /stconvert/
{
  "settings": {
    "analysis": {
      "analyzer": {
        "tsconvert": {
          "tokenizer": "tsconvert"
        }
      },
      "tokenizer": {
        "tsconvert": {
          "type": "stconvert",
          "delimiter": "#",
          "keep_both": false,
          "convert_type": "t2s"
        }
      },
      "filter": {
        "tsconvert": {
          "type": "stconvert",
          "delimiter": "#",
          "keep_both": false,
          "convert_type": "t2s"
        }
      },
      "char_filter": {
        "tsconvert": {
          "type": "stconvert",
          "convert_type": "t2s"
        }
      }
    }
  }
}

Traditional to Simplified

We translate the sentence.

GET stconvert/_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["tsconvert"],
  "text" : "電腦是新的"
}

and we get the Simplified text version:

{
  "tokens" : [
    {
      "token" : "电脑是新的",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    }
  ]
}

We test the translator with another sentence: Today's dinner is very delicious. Spaces are intentional for the reader.

Traditional: 今天 的 晚飯 很 美味

Simplified : 今天 的 晚饭 很 美味

We analyze the text.

GET stconvert/_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["tsconvert"],
  "text" : "今天 的 晚饭 很 美味"
}

We get the Simplified Chinese text.

{
  "tokens" : [
    {
      "token" : "今天 的 晚饭 很 美味",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "word",
      "position" : 0
    }
  ]
}

Summary

We have learned that the Chinese language has two major writing systems, traditional and simplified Chinese. Elasticsearch supports Simplified Chinese. Supporting Traditional Chinese is difficult. The best solution for me is to translate Traditional into Simplified Chinese. Customers can index Traditional Chinese. Elasticsearch will store Simplified Chinese. For the search, everybody uses the Simplified Chinese written version. Problem solved.

What do you think? Are there better solutions? Tell us with a comment. Our customers and we are curious.