Loading TOC...

cts.tokenize

cts.tokenize(
   $text as String,
   [$language as String?],
   [$field as String?]
) as Sequence

Summary

Tokenizes text into words, punctuation, and spaces. Returns output in the type cts:token, which has subtypes cts:word, cts:punctuation, and cts:space, all of which are subtypes of xs:string.

Parameters
$text A word or phrase to tokenize.
$language A language to use for tokenization. If not supplied, it uses the database default language.
$field A field to use for tokenization. If the field has custom tokenization rules, they will be used. If no field is supplied or the field has no custom tokenization rules, the default tokenization rules are used.

Usage Notes

When you tokenize a string with cts:tokenize, each word is represented by an instance of cts:word, each punctuation character is represented by an instance of cts:punctuation, each set of adjacent spaces is represented by an instance of cts:space, and each set of adjacent line breaks is represented by an instance of cts:space.

Unlike the standard XQuery function fn:tokenize, cts:tokenize returns words, punctuation, and spaces as different types. You can therefore use a typeswitch to handle each type differently. For example, you can use cts:tokenize to remove all punctuation from a string, or create logic to test for the type and return different things for different types, as shown in the first two examples below.

You can use xdmp:describe to show how a given string will be tokenized. When run on the results of cts:tokenize, the xdmp:describe function returns the types and the values for each token. For a sample of this pattern, see the third example below.

Example

// Remove all punctuation, normalize space
const string = "The red, blue, green, and orange \
                balloons were launched!";
const noPunctuation = new Array();
for (const token of cts.tokenize(string)) {
      if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "punctuation"))) { }
      else if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "word"))) {
        noPunctuation.push(token); }
      else if (fn.deepEqual(sc.name(sc.type(token)),
              fn.QName("http://marklogic.com/cts", "space"))) { }
      else {  };
      };
noPunctuation.join(" ");

=> The red blue green and orange balloons were launched

Example

// Insert the string "XX" before and after
//   all punctuation tokens
const str = "The red, blue, green, and orange \
                 balloons were launched!" ;
const tokens = cts.tokenize(str);
const res = new Array();
for (const x of tokens) {
  if ( fn.deepEqual(sc.name(sc.type(x)),
              fn.QName("http://marklogic.com/cts", "punctuation")))  {
       res.push(fn.concat("XX", x, "XX")); }
       else { res.push(x); };
};
fn.normalizeSpace(res.join(" "));

=> The redXX,XX blueXX,XX greenXX,XX and orange balloons were launchedXX!XX

Example

// show the types and tokens for a string
xdmp.describe(cts.tokenize("blue, green"), 20)

=> Sequence("blue", ",", " ", "green")

// the same example, iterating over the Sequence results
const res = new Array();
for (const x of cts.tokenize("blue, green")) {
	res.push(sc.name(sc.type(cts.tokenize(x)))); };
res;

=> ["cts:word","cts:punctuation","cts:space","cts:word"]

Comments

  • I was looking at the docs of admin:database-tokenizer-override and found that there are four ionization classes: "word", "space", "punctuation", or "symbol" . the cts:tokenize function tokenizes texts into three types:words, punctuation, and spaces. Can anyone please explain the absence of symbol in the latter list? I am probably not able to connect some dots here.
  • Hi, can we use other tokenization way? And can we defined analyst (maybe only tokenization) way of doc index and search process ?
    • Are you looking for custom tokenization? If so, there is a chapter in <a href="http://docs.marklogic.com/guide/search-dev/custom-tokenization">Search Developer's Guide on custom tokenization</a>.
      • Maybe no. I want to find a appropriate way to tokenize sentences. Perhaps the chapter cannot help. I want to recognize the entities in query sentence, so I want to get the entities during tokenization. Now, my solution is to build a <strong> trie-tree</strong> and use <strong>maximum forward matching</strong> to query the tree via javascript in<strong> Marklogic app server</strong>. But you told me that when I store javascript object by <a href="https://docs.marklogic.com/xdmp.setServerField">xdmp.setServerField</a>, it will be converted. So I want to find a true way to handle this requirement.
        • Most of our customers use an external services to recognize entities. Once they are found, you should materialize them in the document, stored as metadata using the <a href="https://mlu.marklogic.com/ondemand/3bc5ff74">Envelope Pattern</a>. If you'd like more detail than that, I suggest asking on <a href="http://stackoverflow.com/questions/ask?tags=marklogic">Stack Overflow</a>, where we'd have better formatting and more people will see your question.
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy