cts:tokenize( $text as xs:string, [$language as xs:string?], [$field as xs:string?] ) as cts:token*
Tokenizes text into words, punctuation, and spaces. Returns output in
the type cts:token
, which has subtypes
cts:word
, cts:punctuation
, and
cts:space
, all of which are subtypes of
xs:string
.
When you tokenize a string with cts:tokenize
, each word is
represented by an instance of
cts:word
, each punctuation character
is represented by an instance of cts:punctuation
,
each set of adjacent spaces is represented by an instance of
cts:space
, and each set of adjacent line breaks
is represented by an instance of cts:space
.
Unlike the standard XQuery function fn:tokenize
,
cts:tokenize
returns words, punctuation, and spaces
as different types. You can therefore use a typeswitch to handle each type
differently. For example, you can use cts:tokenize
to remove
all punctuation from a string, or create logic to test for the type and
return different things for different types, as shown in the first
two examples below.
You can use xdmp:describe
to show how a given string will be
tokenized. When run on the results of cts:tokenize
, the
xdmp:describe
function returns the types and the values
for each token. For a sample of this pattern, see the third example below.
(: Remove all punctuation :) let $string := "The red, blue, green, and orange balloons were launched!" let $noPunctuation := for $token in cts:tokenize($string) return typeswitch ($token) case $token as cts:punctuation return "" case $token as cts:word return $token case $token as cts:space return $token default return () return string-join($noPunctuation, "")
=> The red blue green and orange balloons were launched
(: Insert the string "XX" before and after all punctuation tokens :) let $string := "The red, blue, green, and orange balloons were launched!" let $tokens := cts:tokenize($string) return string-join( for $x in $tokens return if ($x instance of cts:punctuation) then (concat("XX", $x, "XX")) else ($x) , "") => The redXX,XX blueXX,XX greenXX,XX and orange balloons were launchedXX!XX
(: show the types and tokens for a string :) xdmp:describe(cts:tokenize("blue, green")) => (cts:word("blue"), cts:punctuation(","), cts:space(" "), cts:word("green"))
Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.