
cts:tokenize( $text as xs:string, [$language as xs:string?], [$field as xs:string?] ) as cts:token*
Tokenizes text into words, punctuation, and spaces. Returns output in
the type cts:token, which has subtypes
cts:word, cts:punctuation, and
cts:space, all of which are subtypes of
xs:string.
When you tokenize a string with cts:tokenize, each word is
represented by an instance of
cts:word, each punctuation character
is represented by an instance of cts:punctuation,
each set of adjacent spaces is represented by an instance of
cts:space, and each set of adjacent line breaks
is represented by an instance of cts:space.
Unlike the standard XQuery function fn:tokenize,
cts:tokenize returns words, punctuation, and spaces
as different types. You can therefore use a typeswitch to handle each type
differently. For example, you can use cts:tokenize to remove
all punctuation from a string, or create logic to test for the type and
return different things for different types, as shown in the first
two examples below.
You can use xdmp:describe to show how a given string will be
tokenized. When run on the results of cts:tokenize, the
xdmp:describe function returns the types and the values
for each token. For a sample of this pattern, see the third example below.
(: Remove all punctuation :)
let $string := "The red, blue, green, and orange
balloons were launched!"
let $noPunctuation :=
for $token in cts:tokenize($string)
return
typeswitch ($token)
case $token as cts:punctuation return ""
case $token as cts:word return $token
case $token as cts:space return $token
default return ()
return string-join($noPunctuation, "")
=> The red blue green and orange
balloons were launched
(: Insert the string "XX" before and after
all punctuation tokens :)
let $string := "The red, blue, green, and orange
balloons were launched!"
let $tokens := cts:tokenize($string)
return string-join(
for $x in $tokens
return if ($x instance of cts:punctuation)
then (concat("XX",
$x, "XX"))
else ($x) , "")
=> The redXX,XX blueXX,XX greenXX,XX and orange
balloons were launchedXX!XX
(: show the types and tokens for a string :)
xdmp:describe(cts:tokenize("blue, green"))
=> (cts:word("blue"), cts:punctuation(","),
cts:space(" "), cts:word("green"))