
cts.tokenize( text as String, [language as String?], [field as String?] ) as Sequence
Tokenizes text into words, punctuation, and spaces. Returns output in
the type cts:token, which has subtypes
cts:word, cts:punctuation, and
cts:space, all of which are subtypes of
xs:string.
When you tokenize a string with cts:tokenize, each word is
represented by an instance of
cts:word, each punctuation character
is represented by an instance of cts:punctuation,
each set of adjacent spaces is represented by an instance of
cts:space, and each set of adjacent line breaks
is represented by an instance of cts:space.
Unlike the standard XQuery function fn:tokenize,
cts:tokenize returns words, punctuation, and spaces
as different types. You can therefore use a typeswitch to handle each type
differently. For example, you can use cts:tokenize to remove
all punctuation from a string, or create logic to test for the type and
return different things for different types, as shown in the first
two examples below.
You can use xdmp:describe to show how a given string will be
tokenized. When run on the results of cts:tokenize, the
xdmp:describe function returns the types and the values
for each token. For a sample of this pattern, see the third example below.
// Remove all punctuation, normalize space
const string = "The red, blue, green, and orange \
balloons were launched!";
let noPunctuation = new Array();
for (const token of cts.tokenize(string)) {
if (fn.deepEqual(sc.name(sc.type(token)),
fn.QName("http://marklogic.com/cts", "punctuation"))) { }
else if (fn.deepEqual(sc.name(sc.type(token)),
fn.QName("http://marklogic.com/cts", "word"))) {
noPunctuation.push(token); }
else if (fn.deepEqual(sc.name(sc.type(token)),
fn.QName("http://marklogic.com/cts", "space"))) { }
else { };
};
noPunctuation.join(" ");
=> The red blue green and orange balloons were launched
// Insert the string "XX" before and after
// all punctuation tokens
const str = "The red, blue, green, and orange \
balloons were launched!" ;
const tokens = cts.tokenize(str);
const res = new Array();
for (const x of tokens) {
if ( fn.deepEqual(sc.name(sc.type(x)),
fn.QName("http://marklogic.com/cts", "punctuation"))) {
res.push(fn.concat("XX", x, "XX")); }
else { res.push(x); };
};
fn.normalizeSpace(res.join(" "));
=> The redXX,XX blueXX,XX greenXX,XX and orange balloons were launchedXX!XX
// show the types and tokens for a string
xdmp.describe(cts.tokenize("blue, green"), 20)
=> Sequence("blue", ",", " ", "green")
// the same example, iterating over the Sequence results
const res = new Array();
for (const x of cts.tokenize("blue, green")) {
res.push(sc.name(sc.type(cts.tokenize(x)))); };
res;
=> ["cts:word","cts:punctuation","cts:space","cts:word"]
Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.