cts.tokenize( text as String, [language as String?], [field as String?] ) as Sequence
Tokenizes text into words, punctuation, and spaces. Returns output in
the type cts:token
, which has subtypes
cts:word
, cts:punctuation
, and
cts:space
, all of which are subtypes of
xs:string
.
When you tokenize a string with cts:tokenize
, each word is
represented by an instance of
cts:word
, each punctuation character
is represented by an instance of cts:punctuation
,
each set of adjacent spaces is represented by an instance of
cts:space
, and each set of adjacent line breaks
is represented by an instance of cts:space
.
Unlike the standard XQuery function fn:tokenize
,
cts:tokenize
returns words, punctuation, and spaces
as different types. You can therefore use a typeswitch to handle each type
differently. For example, you can use cts:tokenize
to remove
all punctuation from a string, or create logic to test for the type and
return different things for different types, as shown in the first
two examples below.
You can use xdmp:describe
to show how a given string will be
tokenized. When run on the results of cts:tokenize
, the
xdmp:describe
function returns the types and the values
for each token. For a sample of this pattern, see the third example below.
// Remove all punctuation, normalize space const string = "The red, blue, green, and orange \ balloons were launched!"; const noPunctuation = new Array(); for (const token of cts.tokenize(string)) { if (fn.deepEqual(sc.name(sc.type(token)), fn.QName("http://marklogic.com/cts", "punctuation"))) { } else if (fn.deepEqual(sc.name(sc.type(token)), fn.QName("http://marklogic.com/cts", "word"))) { noPunctuation.push(token); } else if (fn.deepEqual(sc.name(sc.type(token)), fn.QName("http://marklogic.com/cts", "space"))) { } else { }; }; noPunctuation.join(" "); => The red blue green and orange balloons were launched
// Insert the string "XX" before and after // all punctuation tokens const str = "The red, blue, green, and orange \ balloons were launched!" ; const tokens = cts.tokenize(str); const res = new Array(); for (const x of tokens) { if ( fn.deepEqual(sc.name(sc.type(x)), fn.QName("http://marklogic.com/cts", "punctuation"))) { res.push(fn.concat("XX", x, "XX")); } else { res.push(x); }; }; fn.normalizeSpace(res.join(" ")); => The redXX,XX blueXX,XX greenXX,XX and orange balloons were launchedXX!XX
// show the types and tokens for a string xdmp.describe(cts.tokenize("blue, green"), 20) => Sequence("blue", ",", " ", "green") // the same example, iterating over the Sequence results const res = new Array(); for (const x of cts.tokenize("blue, green")) { res.push(sc.name(sc.type(cts.tokenize(x)))); }; res; => ["cts:word","cts:punctuation","cts:space","cts:word"]
Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.