MarkLogic 10 Product Documentation
cts:tokenize

cts:tokenize(
   $text as xs:string,
   [$language as xs:string?],
   [$field as xs:string?]
) as cts:token*

Summary

Tokenizes text into words, punctuation, and spaces. Returns output in the type cts:token, which has subtypes cts:word, cts:punctuation, and cts:space, all of which are subtypes of xs:string.

Parameters
text A word or phrase to tokenize.
language A language to use for tokenization. If not supplied, it uses the database default language.
field A field to use for tokenization. If the field has custom tokenization rules, they will be used. If no field is supplied or the field has no custom tokenization rules, the default tokenization rules are used.

Usage Notes

When you tokenize a string with cts:tokenize, each word is represented by an instance of cts:word, each punctuation character is represented by an instance of cts:punctuation, each set of adjacent spaces is represented by an instance of cts:space, and each set of adjacent line breaks is represented by an instance of cts:space.

Unlike the standard XQuery function fn:tokenize, cts:tokenize returns words, punctuation, and spaces as different types. You can therefore use a typeswitch to handle each type differently. For example, you can use cts:tokenize to remove all punctuation from a string, or create logic to test for the type and return different things for different types, as shown in the first two examples below.

You can use xdmp:describe to show how a given string will be tokenized. When run on the results of cts:tokenize, the xdmp:describe function returns the types and the values for each token. For a sample of this pattern, see the third example below.

Example

(: Remove all punctuation :)
let $string := "The red, blue, green, and orange
                balloons were launched!"
let $noPunctuation :=
  for $token in cts:tokenize($string)
  return
    typeswitch ($token)
     case $token as cts:punctuation return ""
     case $token as cts:word return $token
     case $token as cts:space return $token
     default return ()
return string-join($noPunctuation, "")
  
=> The red blue green and orange balloons were launched

Example

(: Insert the string "XX" before and after
   all punctuation tokens :)
let $string := "The red, blue, green, and orange
                 balloons were launched!"
let $tokens := cts:tokenize($string)
return string-join(
for $x in $tokens
return if ($x instance of cts:punctuation)
       then (concat("XX",
                     $x, "XX"))
       else ($x) , "")
=> The redXX,XX blueXX,XX greenXX,XX and orange
    balloons were launchedXX!XX

Example

(: show the types and tokens for a string :)
xdmp:describe(cts:tokenize("blue, green"))

=> (cts:word("blue"), cts:punctuation(","),
    cts:space(" "), cts:word("green"))
Powered by MarkLogic Server | Terms of Use | Privacy Policy