Use custom tokenizer overrides on fields to change the classification of characters in content and query text. Character re-classification affects searches because it changes the tokenization of text blocks. The following topics are covered:
A custom tokenizer override enables you to change the tokenizer classification of a character when it occurs within a field. You can use this flexibility to improve search efficiency, enable searches for special character patterns, and normalize data.
You can only define a custom tokenizer override on a field. For details, see Overview of Fields in the Administrator's Guide.
As discussed in Tokenization and Stemming, tokenization breaks text content and query text into word, punctuation, and whitespace tokens. Built-in language specific rules define how to break text into tokens. During tokenization, each character is classified as a word, symbol, punctuation, or space character. Each symbol, punctuation, or space character is one token. Adjacent word characters are grouped together into a single word token. Word and symbol tokens are indexed; space and punctuation tokens are not.
For example, with default tokenization, a text run of the form '456-1111' breaks down into 2 word tokens and 1 punctuation token. You can use a query similar to the following one to examine the break down:
xquery version "1.0-ml"; xdmp:describe(cts:tokenize("456-1111")); ==> ( cts:word("456"), cts:punctuation("-"), cts:word("1111") )
If you define a custom tokenizer override that classifies hyphen as a character to remove, the tokenizer produces the single word token "4561111". In combination with other database configuration changes, this can enable more accurate wildcard searches or allow searches to match against variable input such as 456-1111 and 4561111. For a full example, see Example: Improving Accuracy of Wildcard-Enabled Searches.
Tokenization rules apply to both content and query text. Since tokenizer overrides can only be applied to fields, you must use field queries such as cts:field-word-query or cts:field-value-query to take advantage of your overrides.
You cannot override a character to its default class. For example, the space character (' ') has class space
by default, so you cannot define an override that classifies it as space
.
You cannot override a composite character that decomposes into multiple characters when NFD Unicode normalization is applied.
You can define a custom tokenizer override to classify a character as one of the following categories:
space
: Treat as whitespace. Whitespace is not indexed.word
: Adjacent word characters are grouped into a single token that is indexed.symbol
: Each symbol character forms a single token that is indexed.remove
: Eliminate the character from consideration when creating tokens. Note that re-classifying a character such that it is treated as punctuation in query text can trigger punctuation-sensitive field word and field value queries. For details, see Wildcarding and Punctuation Sensitivity.
The table below illustrates how each kind of override impacts tokenization:
Char Class | Example Input | Default Tokenization | Override | Result |
---|---|---|---|---|
space | 10x40 |
10x40 (word) |
admin:database-tokenizer-override("x","space") | 10 (word)
x (space)
40 (word) |
word | 456-1111 |
456 (word)
- (punc.)
1111 (word) |
admin:database-tokenizer-override("-","word") | 456-1111 (word) |
symbol | @1.2 |
@ (punc.)
1.2 (word) |
admin:database-tokenizer-override("@","symbol") | @ (symbol)
1.2 (word) |
remove | 456-1111 |
456 (word)
- (punc.)
1111 (word) |
admin:database-tokenizer-override("-","remove") | 4561111 (word) |
You can use xdmp:describe with cts:tokenize to explore how the use of fields and tokenizer overrides affects tokenization. Passing the name of a field to cts:tokenize applies the overrides defined on the field to tokenization.
The following example shows the difference between the default tokenization rules (no field), and tokenization using a field named 'dim' that defines tokenizer overrides. For configuration details on the field used in this example, see Example: Searching Within a Word.
xquery version "1.0-ml"; xdmp:describe(cts:tokenize("20x40")); xdmp:describe(cts:tokenize("20x40",(),"dim")) ==> cts:word("20x40") (cts:word("20"), cts:space("x"), cts:word("40"))
You can use this method to test the effect of new overrides.
You can also include a language as the second argument to cts:tokenize, to explore language specific effects on tokenization. For example:
xdmp:describe(cts:tokenize("20x40","en","dim"))
For more details on the interaction between language and tokenization, see Language Support in MarkLogic Server.
Using tokenizer overrides can make indexing and searching take longer, so you should only use overrides when truly necessary. For example, if a custom override is in scope for a search, then filtered field queries require retokenization of every text node checked during filtering.
If you have a small number of tokenizer overrides, the impact should be modest.
If you define a custom tokenizer on a field with a very broad scope, expect a larger performance hit. Choosing to re-classify commonly occurring characters like ' ' (space) as a symbol or word character can cause index space requirements to greatly increase.
You can only define a custom tokenizer override on a field. You can configure a custom tokenizer override in the following ways:
Even if reindexing is disabled, when you add tokenizer overrides to a field, those tokenization changes take effect immediately, so all new queries against the field will use the new tokenization (even if it is indexed with the previous tokenization).
For example, assuming the database configuration already includes a field named phone
, the following XQuery adds a custom tokenizer override to the field that classifies "-" as a remove
character and "@"
as a symbol:
xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; let $dbid := xdmp:database("myDatabase") let $config := admin:database-add-field-tokenizer-override( admin:get-configuration(), $dbid, "phone", (admin:database-tokenizer-override("-","remove"), admin:database-tokenizer-override("@", "symbol")) ) return admin:save-configuration($config)
This section contains the following examples related to using custom tokenizer overrides:
The following query demonstrates creating a field, field range index, and custom tokenizer overrides using the Admin API. You can also perform these operations using the Admin Interface.
Use this example as a template if you prefer to use XQuery to configure the fields used in the remaining examples in this section. Replace the database name, field name, included element name, and tokenizer overrides with settings appropriate for your use case.
(: Create the field :) xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; let $config := admin:get-configuration() let $dbid := xdmp:database("YourDatabase") return admin:save-configuration( admin:database-add-field( $config, $dbid, admin:database-field("example", fn:false()) ) ); (: Configure the included elements and field range index :) xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; let $dbid := xdmp:database("YourDatabase") let $config := admin:get-configuration() let $config := admin:database-add-field-included-element( $config, $dbid, "example", admin:database-included-element( "", "your-element", 1.0, "", "", "") ) let $config := admin:database-add-range-field-index( $config, $dbid, admin:database-range-field-index( "string", "example", "http://marklogic.com/collation/", fn:false()) ) return admin:save-configuration($config); (: Define custom tokenizer overrides :) xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; let $config := admin:get-configuration() let $dbid := xdmp:database("YourDatabase") return admin:save-configuration( admin:database-add-field-tokenizer-override( $config, $dbid, "example", (admin:database-tokenizer-override("(", "remove"), admin:database-tokenizer-override(")", "remove"), admin:database-tokenizer-override("-", "remove")) ) );
This example demonstrates using custom tokenizers to improve the accuracy and efficiency of unfiltered search on phone numbers when three character searches are enabled on the database to support wildcard searches.
Run the following query in Query Console to load the sample data into the database.
xdmp:document-insert("/contacts/Abe.xml", <person> <phone>(202)456-1111</phone> </person>); xdmp:document-insert("/contacts/Mary.xml", <person> <phone>(202)456-1112</phone> </person>); xdmp:document-insert("/contacts/Bob.xml", <person> <phone>(202)111-4560</phone> </person>); xdmp:document-insert("/contacts/Bubba.xml", <person> <phone>(456)202-1111</phone> </person>)
Use the Admin Interface or a query similar to the following to enable three character searches on the database to support wildcard searches.
xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; let $config := admin:get-configuration() return admin:save-configuration( admin:database-set-three-character-searches($config, xdmp:database("YourDatabase"), fn:true()) )
With three character searches enabled, the following unfiltered search returns false positives because the query text tokenizes into two word tokens, '202' and '456', with the wildcard applying only to the '456' token.
xquery version "1.0-ml"; cts:search( fn:doc(), cts:word-query("(202)456*"), "unfiltered")//phone/text() ==> (456)202-1111 (202)456-1111 (202)111-4560 (202)456-1112
To improve the accuracy of the search, define a field on the phone
element and define tokenizer overrides to eliminate all the punctuation characters from the field values. Use the Admin Interface to define a field with the following characteristics, or modify the query in Example: Configuring a Field with Tokenizer Overrides and run it in Query Console.
Field Property | Setting |
---|---|
Name | phone |
Field type | root |
Include root | false |
Included elements | phone (no namespace) |
Range field index | collation: |
Tokenizer overrides |
This field definition causes the phone numbers to be indexed as single word tokens such as '4562021111' and '2021114560'. If you perform the following search, the false positives are eliminated:
xquery version "1.0-ml"; cts:search(fn:doc(), cts:field-word-query("phone", "(202)456*"), "unfiltered")//phone/text() ==> (202)456-1111 (202)456-1112
If the field definition did not include the tokenizer overrides, the field word query would include the same false positives as the initial word query.
In Example: Improving Accuracy of Wildcard-Enabled Searches, custom tokenizer overrides are used to remove punctuation from phone numbers of the form (202)456-1111. The overrides provide the additional benefit of normalizing query text because the tokenizer overrides apply to query text as well as content.
If you define '(', ')', '-', and ' ' (space) as remove
characters, then a phone number such as (202)456-1111 is indexed as the single word 2024561111, and all the following query text examples will match exactly in an unfiltered search:
The same overrides also normalize indexing during ingestion: If the input data contains all of the above patterns in the phone
element, the data is normalized to a single word token for indexing in all cases.
For sample input and details on configuring an applicable field, see Example: Improving Accuracy of Wildcard-Enabled Searches.
This example demonstrates using custom tokenizer overrides to create multiple tokens out of what would otherwise be considered a single word. This makes it possible to search successfully for a portion of the word.
Suppose you have input documents that include a dimensions
element of the form MxN, where M and N are the number of feet. For example, '10x4' is the measurement of an area that is 10 feet by 4 feet. You cannot search for 'all documents which includes at least one dimension of 10 feet' because 10x4
tokenizes as a single word.
To demonstrate, run the following query in Query Console to load the sample documents:
xquery version "1.0-ml"; xdmp:document-insert("/plots/plot1.xml", <plot> <dimensions>10x4</dimensions> </plot>); xdmp:document-insert("/plots/plot2.xml", <plot> <dimensions>25x10</dimensions> </plot>); xdmp:document-insert("/plots/plot3.xml", <plot> <dimensions>5x4</dimensions> </plot>)
Next, run the following word query against the database and observe that there are no matches:
xquery version "1.0-ml"; cts:search(fn:doc(), cts:word-query("10"),"unfiltered")
Use the Admin Interface to define a field with the following characteristics, or modify the query in Example: Configuring a Field with Tokenizer Overrides and run it in Query Console.
The field divides each dimension
text node into two tokens, split at 'x'. Therefore, the following field word query now finds matches in the example documents:
xquery version "1.0-ml"; cts:search(fn:doc(), cts:field-word-query("dim", "10"),"unfiltered") ==> <plot> <dimensions>25x10</dimensions> </plot> <plot> <dimensions>10x4</dimensions> </plot>
This example demonstrates the value of classifying some characters as symbol
. Suppose you are working with Twitter data, where the appearance of @word
in Tweet text represents a user and #word
represents a topic identifier ('hash tag'). For this example, we want the following search semantics to apply:
NASA
, the search should match ocurrences of the naked term ('NASA
') or topics ('#NASA
'), but not users ('@NASA
').@NASA
), the search should only match users, not naked terms or topics.#NASA
), the search should only match topics, not naked terms or users.The following table summarizes the desired search results:
Query Text | Should Match | Should Not Match |
---|---|---|
NASA |
@NASA |
|
@NASA |
@NASA |
|
#NASA |
#NASA |
If you do not define any token overrides, then the terms NASA
, @NASA
, and #NASA
tokenize as follows:
NASA
: cts:word("NASA")
@NASA
: cts:punctuation("@"), cts:word("NASA")
#NASA
: cts:punctuation("#"), cts:word("NASA")
Assuming a punctuation-insensitive search, this means all three query strings devolve to searching for just NASA
.
If you define a tokenizer override for @
that classifies it as a word character, then @NASA
tokenizes as a single word and will not match naked terms or topics. That is, @NASA
tokenizes as:
cts:word("@NASA")
However, classifying #
as a word character does not have the desired effect. It causes the query text #NASA
to match topics, as intended, but it also excludes matches for naked terms. The solution is to classify #
as a symbol. Doing so causes the following tokenization to occur:
cts:word("#"),cts:word("NASA")
Now, searching for #NASA
matches adjacent occurrences of #
and NASA
, as in a topic, and searching for just NASA
matches both topics and naked terms. Users (@NASA
) continue to be excluded because of the tokenizer override for @
.