Loading TOC...


   $text as String,
   [$language as String?],
   [$field as String?]
) as Sequence


Tokenizes text into words, punctuation, and spaces. Returns output in the type cts:token, which has subtypes cts:word, cts:punctuation, and cts:space, all of which are subtypes of xs:string.

$text A word or phrase to tokenize.
$language A language to use for tokenization. If not supplied, it uses the database default language.
$field A field to use for tokenization. If the field has custom tokenization rules, they will be used. If no field is supplied or the field has no custom tokenization rules, the default tokenization rules are used.

Usage Notes

When you tokenize a string with cts:tokenize, each word is represented by an instance of cts:word, each punctuation character is represented by an instance of cts:punctuation, each set of adjacent spaces is represented by an instance of cts:space, and each set of adjacent line breaks is represented by an instance of cts:space.

Unlike the standard XQuery function fn:tokenize, cts:tokenize returns words, punctuation, and spaces as different types. You can therefore use a typeswitch to handle each type differently. For example, you can use cts:tokenize to remove all punctuation from a string, or create logic to test for the type and return different things for different types, as shown in the first two examples below.

You can use xdmp:describe to show how a given string will be tokenized. When run on the results of cts:tokenize, the xdmp:describe function returns the types and the values for each token. For a sample of this pattern, see the third example below.


// Remove all punctuation, normalize space
var string = "The red, blue, green, and orange \
                balloons were launched!";
var noPunctuation = new Array();
for (var token of cts.tokenize(string)) {
      if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "punctuation"))) { }
      else if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "word"))) {
        noPunctuation.push(token); }
      else if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "space"))) { }
      else {  };
noPunctuation.join(" ");

=> The red blue green and orange balloons were launched


// Insert the string "XX" before and after
//   all punctuation tokens
var str = "The red, blue, green, and orange \
                 balloons were launched!" ;
var tokens = cts.tokenize(str);
var res = new Array();
for (var x of tokens) {
  if ( fn.deepEqual(sc.name(sc.type(x)),
              fn.QName("http://marklogic.com/cts", "punctuation")))  {
       res.push(fn.concat("XX", x, "XX")); }
       else { res.push(x); };
fn.normalizeSpace(res.join(" "));

=> The redXX,XX blueXX,XX greenXX,XX and orange balloons were launchedXX!XX


// show the types and tokens for a string
xdmp.describe(cts.tokenize("blue, green"), 20)

=> Sequence("blue", ",", " ", "green")

// the same example, iterating over the Sequence results
var res = new Array();
for (var x of cts.tokenize("blue, green")) {
	res.push(sc.name(sc.type(cts.tokenize(x)))); };

=> ["cts:word","cts:punctuation","cts:space","cts:word"]


  • Hi, can we use other tokenization way? And can we defined analyst (maybe only tokenization) way of doc index and search process ?
    • Are you looking for custom tokenization? If so, there is a chapter in <a href="http://docs.marklogic.com/guide/search-dev/custom-tokenization">Search Developer's Guide on custom tokenization</a>.
      • Maybe no. I want to find a appropriate way to tokenize sentences. Perhaps the chapter cannot help. I want to recognize the entities in query sentence, so I want to get the entities during tokenization. Now, my solution is to build a <strong> trie-tree</strong> and use <strong>maximum forward matching</strong> to query the tree via javascript in<strong> Marklogic app server</strong>. But you told me that when I store javascript object by <a href="https://docs.marklogic.com/xdmp.setServerField">xdmp.setServerField</a>, it will be converted. So I want to find a true way to handle this requirement.
        • Most of our customers use an external services to recognize entities. Once they are found, you should materialize them in the document, stored as metadata using the <a href="https://mlu.marklogic.com/ondemand/3bc5ff74">Envelope Pattern</a>. If you'd like more detail than that, I suggest asking on <a href="http://stackoverflow.com/questions/ask?tags=marklogic">Stack Overflow</a>, where we'd have better formatting and more people will see your question.
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy