MarkLogic Server 11.0 Product Documentation
cts.tokenize

cts.tokenize(
   text as String,
   [language as String?],
   [field as String?]
) as Sequence

Summary

Tokenizes text into words, punctuation, and spaces. Returns output in the type cts:token, which has subtypes cts:word, cts:punctuation, and cts:space, all of which are subtypes of xs:string.

Parameters
text A word or phrase to tokenize.
language A language to use for tokenization. If not supplied, it uses the database default language.
field A field to use for tokenization. If the field has custom tokenization rules, they will be used. If no field is supplied or the field has no custom tokenization rules, the default tokenization rules are used.

Usage Notes

When you tokenize a string with cts:tokenize, each word is represented by an instance of cts:word, each punctuation character is represented by an instance of cts:punctuation, each set of adjacent spaces is represented by an instance of cts:space, and each set of adjacent line breaks is represented by an instance of cts:space.

Unlike the standard XQuery function fn:tokenize, cts:tokenize returns words, punctuation, and spaces as different types. You can therefore use a typeswitch to handle each type differently. For example, you can use cts:tokenize to remove all punctuation from a string, or create logic to test for the type and return different things for different types, as shown in the first two examples below.

You can use xdmp:describe to show how a given string will be tokenized. When run on the results of cts:tokenize, the xdmp:describe function returns the types and the values for each token. For a sample of this pattern, see the third example below.

Example

// Remove all punctuation, normalize space
const string = "The red, blue, green, and orange \
                balloons were launched!";
let noPunctuation = new Array();
for (const token of cts.tokenize(string)) {
      if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "punctuation"))) { }
      else if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "word"))) {
        noPunctuation.push(token); }
      else if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "space"))) { }
      else {  };
      };
noPunctuation.join(" ");

=> The red blue green and orange balloons were launched

Example

// Insert the string "XX" before and after
//   all punctuation tokens
const str = "The red, blue, green, and orange \
                 balloons were launched!" ;
const tokens = cts.tokenize(str);
const res = new Array();
for (const x of tokens) {
  if ( fn.deepEqual(sc.name(sc.type(x)),
              fn.QName("http://marklogic.com/cts", "punctuation")))  {
       res.push(fn.concat("XX", x, "XX")); }
       else { res.push(x); };
};
fn.normalizeSpace(res.join(" "));

=> The redXX,XX blueXX,XX greenXX,XX and orange balloons were launchedXX!XX

Example

// show the types and tokens for a string
xdmp.describe(cts.tokenize("blue, green"), 20)

=> Sequence("blue", ",", " ", "green")

// the same example, iterating over the Sequence results
const res = new Array();
for (const x of cts.tokenize("blue, green")) {
        res.push(sc.name(sc.type(cts.tokenize(x)))); };
res;

=> ["cts:word","cts:punctuation","cts:space","cts:word"]
Powered by MarkLogic Server | Terms of Use | Privacy Policy