Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 29

Custom Tokenization

Use custom tokenizer overrides on fields to change the classification of characters in content and query text. Character re-classification affects searches because it changes the tokenization of text blocks. The following topics are covered:

Introduction to Custom Tokenizer Overrides

A custom tokenizer override enables you to change the tokenizer classification of a character when it occurs within a field. You can use this flexibility to improve search efficiency, enable searches for special character patterns, and normalize data.

You can only define a custom tokenizer override on a field. For details, see Overview of Fields in the Administrator's Guide.

As discussed in Tokenization and Stemming, tokenization breaks text content and query text into word, punctuation, and whitespace tokens. Built-in language specific rules define how to break text into tokens. During tokenization, each character is classified as a word, symbol, punctuation, or space character. Each symbol, punctuation, or space character is one token. Adjacent word characters are grouped together into a single word token. Word and symbol tokens are indexed; space and punctuation tokens are not.

For example, with default tokenization, a text run of the form '456-1111' breaks down into 2 word tokens and 1 punctuation token. You can use a query similar to the following one to examine the break down:

xquery version "1.0-ml";
xdmp:describe(cts:tokenize("456-1111"));
==> ( cts:word("456"), cts:punctuation("-"), cts:word("1111") )

If you define a custom tokenizer override that classifies hyphen as a character to remove, the tokenizer produces the single word token "4561111". In combination with other database configuration changes, this can enable more accurate wildcard searches or allow searches to match against variable input such as 456-1111 and 4561111. For a full example, see Example: Improving Accuracy of Wildcard-Enabled Searches.

Tokenization rules apply to both content and query text. Since tokenizer overrides can only be applied to fields, you must use field queries such as cts:field-word-query or cts:field-value-query to take advantage of your overrides.

You cannot override a character to its default class. For example, the space character (' ') has class space by default, so you cannot define an override that classifies it as space.

You cannot override a composite character that decomposes into multiple characters when NFD Unicode normalization is applied.

How Character Classification Affects Tokenization

You can define a custom tokenizer override to classify a character as one of the following categories:

  • space: Treat as whitespace. Whitespace is not indexed.
  • word: Adjacent word characters are grouped into a single token that is indexed.
  • symbol: Each symbol character forms a single token that is indexed.
  • remove: Eliminate the character from consideration when creating tokens.

Note that re-classifying a character such that it is treated as punctuation in query text can trigger punctuation-sensitive field word and field value queries. For details, see Wildcarding and Punctuation Sensitivity.

The table below illustrates how each kind of override impacts tokenization:

Char Class Example Input Default Tokenization Override Result
space 10x40 10x40 (word) admin:database-tokenizer-override("x","space") 10 (word) x (space) 40 (word)
word 456-1111 456 (word) - (punc.) 1111 (word) admin:database-tokenizer-override("-","word") 456-1111 (word)
symbol @1.2 @ (punc.) 1.2 (word) admin:database-tokenizer-override("@","symbol") @ (symbol) 1.2 (word)
remove 456-1111 456 (word) - (punc.) 1111 (word) admin:database-tokenizer-override("-","remove") 4561111 (word)

Using xdmp:describe to Explore Tokenization

You can use xdmp:describe with cts:tokenize to explore how the use of fields and tokenizer overrides affects tokenization. Passing the name of a field to cts:tokenize applies the overrides defined on the field to tokenization.

The following example shows the difference between the default tokenization rules (no field), and tokenization using a field named 'dim' that defines tokenizer overrides. For configuration details on the field used in this example, see Example: Searching Within a Word.

xquery version "1.0-ml";
xdmp:describe(cts:tokenize("20x40"));
xdmp:describe(cts:tokenize("20x40",(),"dim"))
==>
cts:word("20x40")
(cts:word("20"), cts:space("x"), cts:word("40"))

You can use this method to test the effect of new overrides.

You can also include a language as the second argument to cts:tokenize, to explore language specific effects on tokenization. For example:

xdmp:describe(cts:tokenize("20x40","en","dim"))

For more details on the interaction between language and tokenization, see Language Support in MarkLogic Server.

Performance Impact of Using Tokenizer Overrides

Using tokenizer overrides can make indexing and searching take longer, so you should only use overrides when truly necessary. For example, if a custom override is in scope for a search, then filtered field queries require retokenization of every text node checked during filtering.

If you have a small number of tokenizer overrides, the impact should be modest.

If you define a custom tokenizer on a field with a very broad scope, expect a larger performance hit. Choosing to re-classify commonly occurring characters like ' ' (space) as a symbol or word character can cause index space requirements to greatly increase.

Defining a Custom Tokenizer Override

You can only define a custom tokenizer override on a field. You can configure a custom tokenizer override in the following ways:

  • Programmatically, using the function admin:database-add-field-tokenizer-override in the Admin API.
  • Interactively, using the Admin Interface. For details, see Configuring Fields in the Administrator's Guide.

    Even if reindexing is disabled, when you add tokenizer overrides to a field, those tokenization changes take effect immediately, so all new queries against the field will use the new tokenization (even if it is indexed with the previous tokenization).

For example, assuming the database configuration already includes a field named phone, the following XQuery adds a custom tokenizer override to the field that classifies "-" as a remove character and "@" as a symbol:

xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin" 
  at "/MarkLogic/admin.xqy";

let $dbid := xdmp:database("myDatabase")
let $config :=
  admin:database-add-field-tokenizer-override(
    admin:get-configuration(), $dbid, "phone",
    (admin:database-tokenizer-override("-","remove"),
     admin:database-tokenizer-override("@", "symbol"))
  )
return admin:save-configuration($config)

Examples of Custom Tokenizer Overrides

This section contains the following examples related to using custom tokenizer overrides:

Example: Configuring a Field with Tokenizer Overrides

The following query demonstrates creating a field, field range index, and custom tokenizer overrides using the Admin API. You can also perform these operations using the Admin Interface.

Use this example as a template if you prefer to use XQuery to configure the fields used in the remaining examples in this section. Replace the database name, field name, included element name, and tokenizer overrides with settings appropriate for your use case.

(: Create the field :)
xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin" 
    at "/MarkLogic/admin.xqy";

let $config := admin:get-configuration()
let $dbid := xdmp:database("YourDatabase")
return admin:save-configuration(
  admin:database-add-field(
    $config, $dbid,
    admin:database-field("example", fn:false())
  )
);

(: Configure the included elements and field range index :)
xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin" 
    at "/MarkLogic/admin.xqy";

let $dbid := xdmp:database("YourDatabase")
let $config := admin:get-configuration()
let $config :=
  admin:database-add-field-included-element(
    $config, $dbid, "example",
    admin:database-included-element(
      "", "your-element", 1.0, "", "", "")
  )
let $config :=
  admin:database-add-range-field-index(
    $config, $dbid,
    admin:database-range-field-index(
      "string", "example", 
      "http://marklogic.com/collation/", fn:false())
  )
return admin:save-configuration($config);

(: Define custom tokenizer overrides :)
xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin"
    at "/MarkLogic/admin.xqy";

let $config := admin:get-configuration()
let $dbid := xdmp:database("YourDatabase")
return admin:save-configuration(
  admin:database-add-field-tokenizer-override(
    $config, $dbid, "example",
    (admin:database-tokenizer-override("(", "remove"),
     admin:database-tokenizer-override(")", "remove"),
     admin:database-tokenizer-override("-", "remove"))
  )
);

Example: Improving Accuracy of Wildcard-Enabled Searches

This example demonstrates using custom tokenizers to improve the accuracy and efficiency of unfiltered search on phone numbers when three character searches are enabled on the database to support wildcard searches.

Run the following query in Query Console to load the sample data into the database.

xdmp:document-insert("/contacts/Abe.xml",
  <person>
    <phone>(202)456-1111</phone>
  </person>);
xdmp:document-insert("/contacts/Mary.xml",
  <person>
    <phone>(202)456-1112</phone>
  </person>);
xdmp:document-insert("/contacts/Bob.xml",
  <person>
    <phone>(202)111-4560</phone>
  </person>);
xdmp:document-insert("/contacts/Bubba.xml",
  <person>
    <phone>(456)202-1111</phone>
  </person>)

Use the Admin Interface or a query similar to the following to enable three character searches on the database to support wildcard searches.

xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin" 
    at "/MarkLogic/admin.xqy";

let $config := admin:get-configuration()
return admin:save-configuration(
  admin:database-set-three-character-searches($config, 
    xdmp:database("YourDatabase"), fn:true())
)

With three character searches enabled, the following unfiltered search returns false positives because the query text tokenizes into two word tokens, '202' and '456', with the wildcard applying only to the '456' token.

xquery version "1.0-ml";
cts:search(
  fn:doc(),
  cts:word-query("(202)456*"),
  "unfiltered")//phone/text()
==> 
(456)202-1111
(202)456-1111
(202)111-4560
(202)456-1112

To improve the accuracy of the search, define a field on the phone element and define tokenizer overrides to eliminate all the punctuation characters from the field values. Use the Admin Interface to define a field with the following characteristics, or modify the query in Example: Configuring a Field with Tokenizer Overrides and run it in Query Console.

Field Property Setting
Name phone
Field type root
Include root false
Included elements phone (no namespace)
Range field index

scalar type: string

field name: phone

collation: http://marklogic.com/collation/ (default)

range value positions: false (default)

invalid values: reject (default)

Tokenizer overrides

remove ( (left parenthesis)

remove ) (right parenthesis)

remove - (hyphen)

This field definition causes the phone numbers to be indexed as single word tokens such as '4562021111' and '2021114560'. If you perform the following search, the false positives are eliminated:

xquery version "1.0-ml";
cts:search(fn:doc(), 
  cts:field-word-query("phone", "(202)456*"),
  "unfiltered")//phone/text()
==> 
(202)456-1111
(202)456-1112

If the field definition did not include the tokenizer overrides, the field word query would include the same false positives as the initial word query.

Example: Data Normalization

In Example: Improving Accuracy of Wildcard-Enabled Searches, custom tokenizer overrides are used to remove punctuation from phone numbers of the form (202)456-1111. The overrides provide the additional benefit of normalizing query text because the tokenizer overrides apply to query text as well as content.

If you define '(', ')', '-', and ' ' (space) as remove characters, then a phone number such as (202)456-1111 is indexed as the single word 2024561111, and all the following query text examples will match exactly in an unfiltered search:

  • (202)456-1111
  • 202-456-1111
  • 202 456-1111
  • 2024561111

The same overrides also normalize indexing during ingestion: If the input data contains all of the above patterns in the phone element, the data is normalized to a single word token for indexing in all cases.

For sample input and details on configuring an applicable field, see Example: Improving Accuracy of Wildcard-Enabled Searches.

Example: Searching Within a Word

This example demonstrates using custom tokenizer overrides to create multiple tokens out of what would otherwise be considered a single word. This makes it possible to search successfully for a portion of the word.

Suppose you have input documents that include a dimensions element of the form MxN, where M and N are the number of feet. For example, '10x4' is the measurement of an area that is 10 feet by 4 feet. You cannot search for 'all documents which includes at least one dimension of 10 feet' because 10x4 tokenizes as a single word.

To demonstrate, run the following query in Query Console to load the sample documents:

xquery version "1.0-ml";
xdmp:document-insert("/plots/plot1.xml",
  <plot>
    <dimensions>10x4</dimensions>
  </plot>);
xdmp:document-insert("/plots/plot2.xml",
  <plot>
    <dimensions>25x10</dimensions>
  </plot>);
xdmp:document-insert("/plots/plot3.xml",
  <plot>
     <dimensions>5x4</dimensions>
  </plot>)

Next, run the following word query against the database and observe that there are no matches:

xquery version "1.0-ml";
cts:search(fn:doc(), cts:word-query("10"),"unfiltered")

Use the Admin Interface to define a field with the following characteristics, or modify the query in Example: Configuring a Field with Tokenizer Overrides and run it in Query Console.

Field Property Setting
Name dim
Field type root
Include root false
Included elements dimensions (no namespace)
Range field index

scalar type: string

field name: dim

collation: http://marklogic.com/collation/ (default)

range value positions: false (default)

invalid values: reject (default)

Tokenizer overrides space x

The field divides each dimension text node into two tokens, split at 'x'. Therefore, the following field word query now finds matches in the example documents:

xquery version "1.0-ml";
cts:search(fn:doc(), cts:field-word-query("dim", "10"),"unfiltered")

==>
<plot>
  <dimensions>25x10</dimensions>
</plot>
<plot>
  <dimensions>10x4</dimensions>
</plot>

Example: Using the Symbol Classification

This example demonstrates the value of classifying some characters as symbol. Suppose you are working with Twitter data, where the appearance of @word in Tweet text represents a user and #word represents a topic identifier ('hash tag'). For this example, we want the following search semantics to apply:

  • If you search for a naked term, such as NASA, the search should match ocurrences of the naked term ('NASA') or topics ('#NASA'), but not users ('@NASA').
  • If you search for a user (@NASA), the search should only match users, not naked terms or topics.
  • If you search for a topic (#NASA), the search should only match topics, not naked terms or users.

The following table summarizes the desired search results:

Query Text Should Match Should Not Match
NASA

NASA

#NASA

@NASA
@NASA @NASA

NASA

#NASA

#NASA #NASA

NASA

@NASA

If you do not define any token overrides, then the terms NASA, @NASA, and #NASA tokenize as follows:

  • NASA: cts:word("NASA")
  • @NASA: cts:punctuation("@"), cts:word("NASA")
  • #NASA: cts:punctuation("#"), cts:word("NASA")

Assuming a punctuation-insensitive search, this means all three query strings devolve to searching for just NASA.

If you define a tokenizer override for @ that classifies it as a word character, then @NASA tokenizes as a single word and will not match naked terms or topics. That is, @NASA tokenizes as:

cts:word("@NASA")

However, classifying # as a word character does not have the desired effect. It causes the query text #NASA to match topics, as intended, but it also excludes matches for naked terms. The solution is to classify # as a symbol. Doing so causes the following tokenization to occur:

cts:word("#"),cts:word("NASA")

Now, searching for #NASA matches adjacent occurrences of # and NASA, as in a topic, and searching for just NASA matches both topics and naked terms. Users (@NASA) continue to be excluded because of the tokenizer override for @.

« Previous chapter
Next chapter »