Tokenization

This page describes how to add tokenization to tables. Tokenization is necessary to create the tokens that are used in the search index.

Overview

Tokenization is the process of transforming values into tokens. The method you use to tokenize a document determines the types and efficiency of the searches that users can perform on it.

Spanner provides tokenizers for natural language text, substrings, verbatim text, numbers, and booleans. The database schema uses the tokenizer that matches the type of search needed for the column. Tokenizers have the following characteristics:

  • Each tokenizer is a SQL function that gets an input, such as a string or a number, and named arguments for additional options.
  • The tokenizer outputs a TOKENLIST.

For example, a text string The quick brown fox jumps over the lazy dog is tokenized into [the,quick,brown,fox,jumps,over,the,lazy,dog]. An HTML string The <b>apple</b> is <i>red</i> is tokenized into [the,apple,is,red].

Tokens have the following characteristics:

  • Tokens are stored in columns that use the TOKENLIST data type.
  • Each token is stored as a sequence of bytes, with an optional set of associated attributes. For example, in full-text applications, a token is typically a single word from a textual document.
  • When tokenizing HTML values, Spanner generates attributes that indicate the prominence of a token within the document. Spanner uses these attributes for scoring to boost more prominent terms (such as a heading).

Tokenizers

Spanner supports the following tokenizer functions:

  • Full-text tokenizer (TOKENIZE_FULLTEXT) produces whole-word tokens for natural language queries.

    Example

    Both of the following functions

    TOKENIZE_FULLTEXT("Yellow apple")
    TOKENIZE_FULLTEXT("Yellow &lt;b>apple&lt;/b>", content_type=>"text/html")
    

    produce the same tokens: [yellow,apple].

  • Substring tokenizer (TOKENIZE_SUBSTRING) generates tokens for each n-gram of each word. It's used to find substrings of words in a text.

    Example

    TOKENIZE_SUBSTRING("hello world", ngram_size_min=>4, ngram_size_max=>6)
    

    Produces the following tokens: [ello,hell,hello,orld,worl,world].

  • N-gram tokenizer (TOKENIZE_NGRAMS) generates n-grams from an input (without splitting it into separate words). It is used to accelerate regular expression predicates.

    Example

    The following function:

    TOKENIZE_NGRAMS("Big Time", ngram_size_min=>4, ngram_size_max=>4)
    

    Produces the following tokens: ["Big ","ig T","g Ti"," Tim", "Time"].

  • Exact match tokenizers (TOKEN and TOKENIZE_BOOL) are used to look up rows containing a certain value in one of their columns. For example, an application that indexes a products catalog might want to search products of a particular brand and color.

    Examples

    The following functions:

    TOKEN("hello")
    TOKEN(["hello", "world"])
    

    Produce the following tokens, respectively: [hello] and [hello,world].

    The following function:

    TOKENIZE_BOOL(true)
    

    Produces the following token: [y].

  • Number tokenizers (TOKENIZE_NUMBER) are used to generate a set of tokens that accelerate numeric comparison searches. For equality conditions, the token is the number itself. For range conditions (like rating >= 3.5) the set of tokens are more elaborate.

    Examples

    The following function statements:

    TOKENIZE_NUMBER(42, comparison_type=>"equality")
    TOKENIZE_NUMBER(42, comparison_type=>"all", granularity=>10, min=>1, max=>100)
    

    Produce the following tokens, respectively: "==42" and "==42", "[1,75]","[36, 45]","[36,55]","[36, 75]".

Tokenization functions are usually used in a generated column expression. These columns are defined as HIDDEN so that they aren't included in SELECT * query results.

The following example uses a full-text tokenizer and a numeric tokenizer to create a database that stores music album names and ratings. The DDL statement does two things:

  1. Defines the data columns AlbumTitle and Rating.
  2. Defines AlbumTitle_Tokens and AlbumRating_Tokens. These TOKENLIST columns tokenize the values in the data columns so that Spanner can index them.

    CREATE TABLE Albums (
      AlbumId STRING(MAX) NOT NULL,
      AlbumTitle STRING(MAX),
      Rating FLOAT64,
      AlbumTitle_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(AlbumTitle)) HIDDEN,
      Rating_Tokens TOKENLIST AS (TOKENIZE_NUMBER(Rating)) HIDDEN
    ) PRIMARY KEY(AlbumId);
    

Whenever the base values are modified, AlbumTitle_Tokens and Rating_Tokens are automatically updated.

Tokenize plain text or HTML content

Text tokenization supports plain text and HTML content types. Use the Spanner TOKENIZE_FULLTEXT function to create tokens. Then use the CREATE SEARCH INDEX DDL statement to generate the search index.

For example, the following CREATE TABLE DDL statement uses the TOKENIZE_FULLTEXT function to create tokens from AlbumTitles in the Albums table. The CREATE SEARCH INDEX DDL statement creates a search index with the new AlbumTitles_Tokens.

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  AlbumTitle STRING(MAX),
  AlbumTitle_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(AlbumTitle)) HIDDEN
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex ON Albums(AlbumTitle_Tokens)

The tokenization process uses the following rules:

  • Tokenization doesn't include stemming or correction of misspelled words. For example, in a sentence like "A cat was looking at a group of cats", the token "cat" is indexed separately from the token "cats". Compared to other search engines that normalize tokens during writes, Spanner provides an option to expand the search query to include different forms of words. For more information, see Enhanced query mode.
  • Stopwords (like "a") are included in the search index.
  • Full-text search is always case insensitive. The tokenization process converts all tokens to lowercase.

The tokenization process tracks the positions for each token in the original text. These positions are later used to match phrases. The positions are stored in the search index alongside docids.

Google continues to improve tokenization algorithms. In some cases, this might lead to a string getting tokenized differently in the future from the way it is tokenized now. We expect such cases to be extremely rare. An example of this is if there's an improvement in the Chinese, Japanese, and Korean (CJK) language segmentation.

The content_type argument specifies whether the content format uses plain text or HTML. Use the following settings to set the content_type:

  • For text tokenization, set the content_type argument to "text/plain". This is the default setting.
  • For HTML tokenization, set the content_type argument to "text/html". Without this argument, HTML tags are treated as punctuation. In HTML mode, Spanner uses heuristics to infer how prominent the text is on the page. For example, whether the text is in a heading or its font size. The supported attributes for HTML include small, medium, large, title, and `link'. Like position, the attribute is stored alongside the token in the search index. Tokenization doesn't create tokens for any HTML tags.

Token attributes don't impact matching or the results of the SEARCH or SEARCH_SUBSTRING function. They're only used for ranking.

The following example shows how to tokenize text:

CREATE TABLE T (
  ...
  Text STRING(MAX),
  Html STRING(MAX),
  Text_Tokens TOKENLIST
    AS (TOKENIZE_FULLTEXT(Text, content_type=>"text/plain")) HIDDEN,
  Html_Tokens TOKENLIST
    AS (TOKENIZE_FULLTEXT(Html, content_type=>"text/html")) HIDDEN
) PRIMARY KEY(...);

Language detection refinement with the language_tag argument

Tokenization detects the input language automatically, by default. When the input language is known, a language_tag argument can be used to refine this behavior:

AlbumTitle_Tokens TOKENLIST
  AS (TOKENIZE_FULLTEXT(AlbumTitle, language_tag=>"en-us")) HIDDEN

Most applications leave the language_tag argument unspecified and instead rely on automatic language detection. Segmentation for Asian languages like Chinese, Korean, and Japanese doesn't require setting the tokenization language.

The following examples show cases where the language_tag affects tokenization:

Tokenization function Produced tokens
TOKENIZE_FULLTEXT("A tout pourquoi il y a un parce que") [a, tout, pourquoi, il, ya, un, parce, que]
TOKENIZE_FULLTEXT("A tout pourquoi il y a un parce que", \ language_tag=>"fr") [a, tout, pourquoi, il, y, a, un, parce, que]
TOKENIZE_FULLTEXT("旅 行") Two tokens: [旅, 行]
TOKENIZE_FULLTEXT("旅 行", language_tag=>"zh") One token: [旅行]

What's next