
cts:tokenize( $text as xs:string, [$language as xs:string?], [$field as xs:string?] ) as cts:token*
  Tokenizes text into words, punctuation, and spaces.  Returns output in
  the type cts:token, which has subtypes
  cts:word, cts:punctuation, and
  cts:space, all of which are subtypes of
  xs:string.
 When you tokenize a string with cts:tokenize, each word is
  represented by an instance of
  cts:word, each punctuation character
  is represented by an instance of cts:punctuation,
  each set of adjacent spaces is represented by an instance of
  cts:space, and each set of adjacent line breaks
  is represented by an instance of cts:space.
   Unlike the standard XQuery function fn:tokenize,
   cts:tokenize returns words, punctuation, and spaces
   as different types. You can therefore use a typeswitch to handle each type
   differently. For example, you can use cts:tokenize to remove
   all punctuation from a string, or create logic to test for the type and
   return different things for different types, as shown in the first
   two examples below.
   You can use xdmp:describe to show how a given string will be
   tokenized. When run on the results of cts:tokenize, the
   xdmp:describe function returns the types and the values
   for each token. For a sample of this pattern, see the third example below.
(: Remove all punctuation :)
let $string := "The red, blue, green, and orange
                balloons were launched!"
let $noPunctuation :=
  for $token in cts:tokenize($string)
  return
    typeswitch ($token)
     case $token as cts:punctuation return ""
     case $token as cts:word return $token
     case $token as cts:space return $token
     default return ()
return string-join($noPunctuation, "")
  
 => The red blue green and orange
    balloons were launched
(: Insert the string "XX" before and after
   all punctuation tokens :)
let $string := "The red, blue, green, and orange
                 balloons were launched!"
let $tokens := cts:tokenize($string)
return string-join(
for $x in $tokens
return if ($x instance of cts:punctuation)
       then (concat("XX",
                     $x, "XX"))
       else ($x) , "")
=> The redXX,XX blueXX,XX greenXX,XX and orange
    balloons were launchedXX!XX
(: show the types and tokens for a string :)
xdmp:describe(cts:tokenize("blue, green"))
=> (cts:word("blue"), cts:punctuation(","),
    cts:space(" "), cts:word("green"))
Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.