MarkLogic Server 11.0 Product Documentation
cts:cluster

cts:cluster(
   $nodes as node()*,
   [$options as (element()|map:map)?]
) as element(cts:clustering)

Summary

Produces a set of clusters from a sequence of nodes. The nodes can be any set of nodes, and are typically the result of a cts:search operation.

Parameters

$nodes The sequence of nodes to cluster.

Parameters
$nodes	The sequence of nodes to cluster.
options	An XML representation of the options for defining the clustering parameters. The options node must be in the `cts:cluster` namespace. The following is a sample options node: <options xmlns="cts:cluster"> <label-max-terms>4</label-max-terms> <max-clusters>6</max-clusters> <use-db-config>true</use-db-config> </options> The `cts:cluster` options include: <`hierarchical-levels`> An integer specifying how many hierarchical cluster levels the clusterer should return. The default is `1`, which means no hierarchical clusters are returned. <`label-max-terms`> An integer specifying the maximum number of terms to use in constructing a cluster label. The default is `3`. <`label-ignore-words`> A space-separated list of words that are to be excluded from cluster label. The default is to not exclude any words. <`label-ignore-attributes`> A boolean that indicates whether attribute terms should be excluded from the cluster label. The default is to include terms from attributes. <`details`> A boolean that indicates whether additional details on the terms used in label generation are to be included in the output. See the documentation on cts:distinctive-terms for details on the format of the terms returned. The default `false`, meaning no such details are given. <`min-clusters`> An integer specifying a minimum number of desired clusters returned (at any hierarchical level). However, if no satisfactory clustering can be produced at a given level, only one cluster will be returned, regardless of this setting. The default is `3`. <`max-clusters`> An integer specifying a maximum number of clusters that can be returned (at any hierarchical level). The default is `15`. <`overlapping`> A boolean indicating whether it is acceptable for nodes to be assigned to more than one cluster. The default is `false`. <`max-terms`> An integer value specifying the maximum number of distinct terms to use in calculating the cluster. The default is `200`. Increasing the value will increase the cost (in terms of both time and memory) of calculating the clusters, but may improve the quality of the clusters. <`algorithm`> A value indicating which clustering algorithm to use, either `k-means` or `lsi`. The default is `k-means`. The LSI algorithm is significantly more expensive to compute, both in terms of time and space. <`num-tries`> Specifies the number of times to run the clusterer against the specified data. The default is 1. Because of the way the algorithms work, running the cluster multiple times will increase the number of terms, and tends to improve the accuratacy of the clusters. It does so at the cost of performance, as each time it runs, it has to do more work. <`use-db-config`> A boolean value indicating whether to use the current DB configuration for determining which terms to use. The default is `false`, which means that the default set of options, as well as any indexing options you specify in the options node, will be used for calculating the clusters and their labels. When set to `true`, any indexing options set in the context database configuration (including any field settings) are used, as well as any default settings that you have not explicitly turned off in the options node. The options element also includes indexing options in the `http://marklogic.com/xdmp/database` namespace. These control which terms to use. Note that the use of certain options, such as `fast-case-sensitive-searches` , will not impact final results unless the term vector size is limited with the `max-terms` option. Other options, such as `phrase-throughs` , will only generate terms if some other option is also enabled (in this case `fast-phrase-searches` ). The database options are the same as the database options shown for `cts:distinctive-terms` .

options

An XML representation of the options for defining the clustering parameters. The options node must be in the cts:cluster namespace. The following is a sample options node:



    <options xmlns="cts:cluster">
      <label-max-terms>4</label-max-terms>
      <max-clusters>6</max-clusters>
      <use-db-config>true</use-db-config>
    </options>

The cts:cluster options include:

<hierarchical-levels>

An integer specifying how many hierarchical cluster levels the clusterer should return. The default is 1, which means no hierarchical clusters are returned.

<label-max-terms>

An integer specifying the maximum number of terms to use in constructing a cluster label. The default is 3.

<label-ignore-words>

A space-separated list of words that are to be excluded from cluster label. The default is to not exclude any words.

<label-ignore-attributes>

A boolean that indicates whether attribute terms should be excluded from the cluster label. The default is to include terms from attributes.

<details>

A boolean that indicates whether additional details on the terms used in label generation are to be included in the output. See the documentation on cts:distinctive-terms for details on the format of the terms returned. The default false, meaning no such details are given.

<min-clusters>

An integer specifying a minimum number of desired clusters returned (at any hierarchical level). However, if no satisfactory clustering can be produced at a given level, only one cluster will be returned, regardless of this setting. The default is 3.

<max-clusters>

An integer specifying a maximum number of clusters that can be returned (at any hierarchical level). The default is 15.

<overlapping>

A boolean indicating whether it is acceptable for nodes to be assigned to more than one cluster. The default is false.

<max-terms>

An integer value specifying the maximum number of distinct terms to use in calculating the cluster. The default is 200. Increasing the value will increase the cost (in terms of both time and memory) of calculating the clusters, but may improve the quality of the clusters.

<algorithm>

A value indicating which clustering algorithm to use, either k-means or lsi. The default is k-means. The LSI algorithm is significantly more expensive to compute, both in terms of time and space.

<num-tries>

Specifies the number of times to run the clusterer against the specified data. The default is 1. Because of the way the algorithms work, running the cluster multiple times will increase the number of terms, and tends to improve the accuratacy of the clusters. It does so at the cost of performance, as each time it runs, it has to do more work.

<use-db-config>

A boolean value indicating whether to use the current DB configuration for determining which terms to use. The default is false, which means that the default set of options, as well as any indexing options you specify in the options node, will be used for calculating the clusters and their labels. When set to true, any indexing options set in the context database configuration (including any field settings) are used, as well as any default settings that you have not explicitly turned off in the options node.

The options element also includes indexing options in the http://marklogic.com/xdmp/database namespace. These control which terms to use. Note that the use of certain options, such as fast-case-sensitive-searches , will not impact final results unless the term vector size is limited with the max-terms option. Other options, such as phrase-throughs , will only generate terms if some other option is also enabled (in this case fast-phrase-searches ).

The database options are the same as the database options shown for cts:distinctive-terms .

Example



cts:cluster(
  cts:search(//MILITARY, cts:word-query("apache"))[1 to 100],
  <options xmlns="cts:cluster" xmlns:db="http://marklogic.com/xdmp/database">
    <hierarchical-levels>2</hierarchical-levels>
    <overlapping>false</overlapping>
    <label-max-terms>3</label-max-terms>
    <max-clusters>100</max-clusters>
    <label-ignore-words>of the on in at a an for from by and</label-ignore-words>
    <db:stemmed-searches>advanced</db:stemmed-searches>
    <db:fast-phrase-searches>true</db:fast-phrase-searches>
    <db:fast-element-word-searches>true</db:fast-element-word-searches>
    <db:fast-element-phrase-searches>true</db:fast-element-phrase-searches>
  </options>)
==>
 <clustering xmlns="http://marklogic.com/cts">
  <cluster id="123456" label="apache helicopters" count="7" nodes="3 34 31 98 34 23 39"/>
  <cluster id="374632" label="apache linux" count="6" nodes="1 378 56 23 93 6"/>
  <cluster id="3452231" label="navajo codetalkers" count="8" nodes="44 87 32 77 50 12 13 15"/>
  ...
  <cluster id="2234" parent-id="123456" label="AH-64" count="2" nodes="3 39"/>
  <cluster id="34321" parent-id="123456" label="air force" count="5" nodes="34 31 98 34 23"/>
  <cluster id="34523" parent-id="374632" label="HTTP" count="3" nodes="1 56 23"/>
  <cluster id="968" parent-id="374632" label="LAMP" count="3" nodes="378 93 6"/>
  <options xmlns="cts:cluster" xmlns:db="http://marklogic.com/xdmp/database">
    <algorithm>k-means</algorithm>
    <db:stemmed-searches>advanced</db:stemmed-searches>
    <db:fast-element-word-searches>true</db:fast-element-word-searches>
    <db:fast-element-phrase-searches>true</db:fast-element-phrase-searches>
    <db:language>en</db:language>
    <max-clusters>100</max-clusters>
    <min-clusters>2</min-clusters>
    <hierarchical-levels>2</hierarchical-levels>
    <initialization>smart</initialization>
    <label-max-terms>3</label-max-terms>
    <num-tries>1</num-tries>
    <score>logtfidf</score>
    <use-db-config>false</use-db-config>
  </options>
</clustering>

Example

cts:cluster(
   cts:search(//function, "foo"),
   <options xmlns="cts:cluster">
     <use-db-config>true</use-db-config>
   </options>)
=> The cts:clustering element

MarkLogic Server 11.0 Product Documentationcts:cluster

Summary

Example

Example

MarkLogic Server 11.0 Product Documentation
cts:cluster