MarkLogic Server 11.0 Product Documentation
cts.cluster

cts.cluster(
   nodes as Array,
   [options as Object?]
) as Object

Summary

Produces a set of clusters from an array of nodes. The nodes can be any set of nodes, and are typically the result of a cts.search operation.

Parameters

nodes The array of nodes to cluster.

Parameters
nodes	The array of nodes to cluster.
options	An object representation of the options for defining the clustering parameters. The following is a sample options object: { labelMaxTerms: 4, maxClusters: 6, useDbConfig: true } The `cts.cluster` options include: `hierarchicalLevels` An integer specifying how many hierarchical cluster levels the clusterer should return. The default is `1`, which means no hierarchical clusters are returned. `labelMaxTerms` An integer specifying the maximum number of terms to use in constructing a cluster label. The default is `3`. `labelIgnoreWords` A single word or an array of words that are to be excluded from cluster label. The default is to not exclude any words. `labelIgnoreAttributes` A boolean that indicates whether attribute terms should be excluded from the cluster label. The default is to include terms from attributes. `details` A boolean that indicates whether additional details on the terms used in label generation are to be included in the output. See the documentation on cts:distinctive-terms for details on the format of the terms returned. The default `false`, meaning no such details are given. `minClusters` An integer specifying a minimum number of desired clusters returned (at any hierarchical level). However, if no satisfactory clustering can be produced at a given level, only one cluster will be returned, regardless of this setting. The default is `3`. `maxClusters` An integer specifying a maximum number of clusters that can be returned (at any hierarchical level). The default is `15`. `overlapping` A boolean indicating whether it is acceptable for nodes to be assigned to more than one cluster. The default is `false`. `maxTerms` An integer value specifying the maximum number of distinct terms to use in calculating the cluster. The default is `200`. Increasing the value will increase the cost (in terms of both time and memory) of calculating the clusters, but may improve the quality of the clusters. `algorithm` A value indicating which clustering algorithm to use, either `k-means` or `lsi`. The default is `k-means`. The LSI algorithm is significantly more expensive to compute, both in terms of time and space. `numTries` Specifies the number of times to run the clusterer against the specified data. The default is 1. Because of the way the algorithms work, running the cluster multiple times will increase the number of terms, and tends to improve the accuratacy of the clusters. It does so at the cost of performance, as each time it runs, it has to do more work. `useDbConfig` A boolean value indicating whether to use the current DB configuration for determining which terms to use. The default is `false`, which means that the default set of options, as well as any indexing options you specify in the options node, will be used for calculating the clusters and their labels. When set to `true`, any indexing options set in the context database configuration (including any field settings) are used, as well as any default settings that you have not explicitly turned off in the options node. The options object also includes indexing options in the `http://marklogic.com/xdmp/database` namespace. These control which terms to use. Note that the use of certain options, such as `fastCaseSensitiveSearches`, will not impact final results unless the term vector size is limited with the `maxTerms` option. Other options, such as `phraseThroughs`, will only generate terms if some other option is also enabled (in this case `fastPhraseSearches`). The database options are the same as the database options shown for `cts.distinctiveTerms`.

options

An object representation of the options for defining the clustering parameters. The following is a sample options object:



    {
      labelMaxTerms: 4,
      maxClusters: 6,
      useDbConfig: true
    }

The cts.cluster options include:

hierarchicalLevels

An integer specifying how many hierarchical cluster levels the clusterer should return. The default is 1, which means no hierarchical clusters are returned.

labelMaxTerms

An integer specifying the maximum number of terms to use in constructing a cluster label. The default is 3.

labelIgnoreWords

A single word or an array of words that are to be excluded from cluster label. The default is to not exclude any words.

labelIgnoreAttributes

A boolean that indicates whether attribute terms should be excluded from the cluster label. The default is to include terms from attributes.

details

A boolean that indicates whether additional details on the terms used in label generation are to be included in the output. See the documentation on cts:distinctive-terms for details on the format of the terms returned. The default false, meaning no such details are given.

minClusters

An integer specifying a minimum number of desired clusters returned (at any hierarchical level). However, if no satisfactory clustering can be produced at a given level, only one cluster will be returned, regardless of this setting. The default is 3.

maxClusters

An integer specifying a maximum number of clusters that can be returned (at any hierarchical level). The default is 15.

overlapping

A boolean indicating whether it is acceptable for nodes to be assigned to more than one cluster. The default is false.

maxTerms

An integer value specifying the maximum number of distinct terms to use in calculating the cluster. The default is 200. Increasing the value will increase the cost (in terms of both time and memory) of calculating the clusters, but may improve the quality of the clusters.

algorithm

A value indicating which clustering algorithm to use, either k-means or lsi. The default is k-means. The LSI algorithm is significantly more expensive to compute, both in terms of time and space.

numTries

Specifies the number of times to run the clusterer against the specified data. The default is 1. Because of the way the algorithms work, running the cluster multiple times will increase the number of terms, and tends to improve the accuratacy of the clusters. It does so at the cost of performance, as each time it runs, it has to do more work.

useDbConfig

A boolean value indicating whether to use the current DB configuration for determining which terms to use. The default is false, which means that the default set of options, as well as any indexing options you specify in the options node, will be used for calculating the clusters and their labels. When set to true, any indexing options set in the context database configuration (including any field settings) are used, as well as any default settings that you have not explicitly turned off in the options node.

The options object also includes indexing options in the http://marklogic.com/xdmp/database namespace. These control which terms to use. Note that the use of certain options, such as fastCaseSensitiveSearches, will not impact final results unless the term vector size is limited with the maxTerms option. Other options, such as phraseThroughs, will only generate terms if some other option is also enabled (in this case fastPhraseSearches).

The database options are the same as the database options shown for cts.distinctiveTerms.

Example


cts.cluster(
    cts.search(cts.wordQuery("steroids")).toArray(),
    {
        algorithm: "lsi",
        hierarchicalLevels: 3,
        minClusters: 2,
        maxClusters: 12,
        overlapping: false,
        labelIgnoreWords: ["of", "the", "on", "in", "at", "a", "an", "for", "from", "by", "and"],
        stemmedSearches: "advanced",
        fastPhraseSearches: true,
        fastElementWordSearches:true,
        fastElementPhraseSearches:true
    }
);
=>
{
  "clusters":[
    {
      "id":"4904706095739760677",
      "label":"neonate, cortisol, fetal",
      "nodes":[3,4,7,9,14]
    },
    {
      "id":"741204961292539384",
      "label":"fetal, cortisol, being",
      "nodes":[8,15]
    },
    {
      "id":"9998437716377655230",
      "label":"locus, male, fetal",
      "nodes":[6]
    },
      ...
    {
      "id":"7956765932334497548",
      "parentId":"14551791662219883254",
      "label":"normal, endometrium, also",
      "nodes":[17]
    },
    {
      "id":"4427100138446341770",
      "parentId":"14551791662219883254",
      "label":"km, administration, do",
      "nodes":[12]
    }
  ],
  "options":{
    "algorithm":"lsi",
    "language":"en",
    "stemmedSearches":"advanced",
    "fastElementPhraseSearches":true,
    "fastElementWordSearches":true,
    "maxClusters":12,
    "minClusters":2,
    "hierarchicalLevels":3,
    "maxTerms":200,
    "labelMaxTerms":3,
    "labelIgnoreWords":[
      "a","an","and","at","by","for","from","in","of","on","the"],
    "labelIgnoreAttributes":false,
    "numTries":1,
    "score":"logtfidf",
    "useDbConfig":false,
    "details":false,
    "overlapping":false
  }
}

Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.

MarkLogic Server 11.0 Product Documentationcts.cluster

Summary

Example

MarkLogic Server 11.0 Product Documentation
cts.cluster