|
|
cts:cluster(
|
|
$nodes as node()*,
|
|
[$options as element()?]
|
| ) as element(cts:clustering) |
|
 |
Summary:
Produces a set of clusters from a sequence of nodes. The nodes can be
any set of nodes, and are typically the result of a cts:search
operation.
|
Parameters:
$nodes
:
The sequence of nodes to cluster.
|
$options
(optional):
An XML representation of the options for defining the clustering
parameters. The options node must be in the cts:cluster
namespace. The following is a sample options node:
<options xmlns="cts:cluster">
<label-max-terms>4</label-max-terms>
<max-clusters>6</max-clusters>
<use-db-config>true</use-db-config>
</options>
The cts:cluster options include:
<xs:element ref="opt:overlapping"/>
<xs:element ref="opt:max-terms"/>
<hierarchical-levels>
- An integer specifying how many hierarchical cluster levels the clusterer
should return. The default is
1, which means no hierarchical
clusters are returned.
<label-max-terms>
- An integer specifying the maximum number of terms to use in constructing
a cluster label. The default is
3.
<label-ignore-words>
- A space-separated list of words that are to be excluded from cluster
label. The default is to not exclude any words.
<details>
- A boolean that indicates whether additional details on the terms
used in label generation are to be included in the output. See the
documentation on cts:distinctive-terms for details on the format of the
terms returned. The default
false, meaning no such details
are given.
<min-clusters>
- An integer specifying a minimum number of desired clusters returned
(at any hierarchical level).
However, if no satisfactory clustering can be produced at a given level,
only one cluster will be returned, regardless of this setting.
The default is
3.
<max-clusters>
- An integer specifying a maximum number of clusters that can be returned
(at any hierarchical level). The default is
15.
<overlapping>
- A boolean indicating whether it is acceptable for nodes to be
assigned to more than one cluster. The default is
false.
<max-terms>
- An integer value specifying the maximum number of distinct terms to
use in calculating the cluster. The default is
200.
Increasing the value will increase the cost (in terms of both time
and memory) of calculating the clusters, but may improve the quality
of the clusters.
<algorithm>
- A value indicating which clustering algorithm to use, either
k-means or lsi. The default is
k-means. The LSI algorithm is significantly more expensive
to compute, both in terms of time and space.
<num-tries>
- Specifies the number of times to run the clusterer against
the specified data. The default is 1.
Because of the way the algorithms work, running
the cluster multiple times will increase the number of terms, and
tends to improve the accuratacy of the clusters. It does so at the
cost of performance, as each time it runs, it has to do more work.
<use-db-config>
- A boolean value indicating whether to use the current DB configuration
for determining which terms to use. The default is
false,
which means that the default set of options, as well as any indexing
options you specify in the options node, will be
used for calculating the clusters and their labels. When set to
true, any indexing options set in the context database
configuration (including any field settings) are used, as well as any
default settings that you have not explicitly turned off in the options
node.
The options element also includes indexing options in the
http://marklogic.com/xdmp/database namespace.
These control which terms to use. Note that the use of certain
options, such as fast-case-sensitive-searches, will not
impact final results unless the term vector size is limited with
the max-terms option. Other options, such as
phrase-throughs, will only generate terms if some
other option is also enabled (in this case
fast-phrase-searches).
The database options are the same as the database options shown for
cts:distinctive-terms.
|
|
Example:
cts:cluster(
cts:search(//MILITARY, cts:word-query("apache"))[1 to 100],
<options xmlns="cts:cluster" xmlns:db="http://marklogic.com/xdmp/database">
<hierarchical-levels>2</hierarchical-levels>
<overlapping>false</overlapping>
<label-max-terms>3</label-max-terms>
<max-clusters>100</max-clusters>
<label-ignore-words>of the on in at a an for from by and</label-ignore-words>
<db:stemmed-searches>advanced</db:stemmed-searches>
<db:fast-phrase-searches>true</db:fast-phrase-searches>
<db:fast-element-word-searches>true</db:fast-element-word-searches>
<db:fast-element-phrase-searches>true</db:fast-element-phrase-searches>
</options>)
==>
<clustering xmlns="http://marklogic.com/cts">
<cluster id="123456" label="apache helicopters" count="7" nodes="3 34 31 98 34 23 39"/>
<cluster id="374632" label="apache linux" count="6" nodes="1 378 56 23 93 6"/>
<cluster id="3452231" label="navajo codetalkers" count="8" nodes="44 87 32 77 50 12 13 15"/>
...
<cluster id="2234" parent="123456" label="AH-64" count="2" nodes="3 39"/>
<cluster id="34321" parent="123456" label="air force" count="5" nodes="34 31 98 34 23"/>
<cluster id="34523" parent="374632" label="HTTP" count="3" nodes="1 56 23"/>
<cluster id="968" parent="374632" label="LAMP" count="3" nodes="378 93 6"/>
<options xmlns="cts:cluster" xmlns:db="http://marklogic.com/xdmp/database">
<algorithm>k-means</algorithm>
<db:stemmed-searches>advanced</db:stemmed-searches>
<db:fast-element-word-searches>true</db:fast-element-word-searches>
<db:fast-element-phrase-searches>true</db:fast-element-phrase-searches>
<db:language>en</db:language>
<max-clusters>100</max-clusters>
<min-clusters>2</min-clusters>
<hierarchical-levels>2</hierarchical-levels>
<initialization>smart</initialization>
<label-max-terms>3</label-max-terms>
<num-tries>1</num-tries>
<score>logtfidf</score>
<use-db-config>false</use-db-config>
</options>
</clustering>
|
Example:
cts:cluster(
cts:search(//function, "foo"),
<options xmlns="cts:cluster">
<use-db-config>true</use-db-config>
</options>)
=> The cts:clustering element
|
|
|