Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 27

Results Clustering Using cts:cluster

MarkLogic Server includes cts:cluster, which uses statistical algorithms to find and label clusters of search results. This chapter describes cts:cluster and includes the following sections:

For details about the signature, the parameter syntax, and more examples, see cts:cluster in the MarkLogic XQuery and XSLT Function Reference.

Understanding cts:cluster

The cts:cluster function takes a set of nodes, typically from a search result set (although it can be any set of nodes), and provides a report that categorizes the result nodes in clusters. A cluster is a subset of the results that are statistically similar. For each cluster, it generates a label from the most distinctive terms in that cluster.

The output is an XML node, and you can use the output to generate a user interface that displays the results. For sample output, see Understanding the cts:cluster Output.

The clusterer creates clusters by taking the nodes you pass into cts:cluster and running it through the MarkLogic Server indexer. This is very similar to the process when you load a document into the database, but the indexing for results clustering is all done in memory, whereas in the database the indexes are stored to disk. The product of indexing is terms, with each term having a frequency (the number of times it occurs in the document and in the result set). Depending on which index settings you use, you will get a different set of terms. The clusterer takes into account each of the terms, as well as information about the terms (for example, weights and term frequency), to calculate the clusters.

You pass options into cts:cluster that determine the behavior of the cluster as well as specify the index settings to use when creating the clusters. For more information about the options, see Options to cts:cluster, as well as the API documentation for cts:cluster in the MarkLogic XQuery and XSLT Function Reference.

When deciding how to use the clusterer, think about what your requirements are. Many settings you choose in the clusterer are trade-offs between performance and the quality of the results clusters. You might need to experiment to find what works well for your application.

Note the following about the clusterer:

  • Every time you cluster, the indexer is run on the supplied nodes to generate the data.
  • The more nodes you send to cts:cluster, the longer it will take. For real time analysis, more than a few thousand might get too slow for a user to wait. Ideally, between 100 and 1000 nodes is a good balance between performance and good results.
  • You can set <hierararchical-levels> to a value of greater than 1 to generate clusters of clusters. The parent attribute tells you which cluster is its parent. You can then iterate through the result set to create a user interface that shows the tree-like hierarchy.
  • The labels might change from run-to-run. Specifying a higher value of <num-tries> tends to make the labels more consistent from run-to-run, but will increase the time it takes to produce the clusters.
  • The labels come from the most distinctive terms. Some terms (such element terms) are turned into strings. If you want to see the terms used to create the labels, set the <details>true</details> option.

Options to cts:cluster

You can set options to cts:cluster in an options node. You can set the following types of options:

Each of these types of options is in its own namespace.

Clustering (cts:cluster) Options

The clustering options are in the cts:cluster namespace. These options determine the output and the behavior of the clusterer. Note the following about the clusterer options:

  • When tuning the options, try to balance performance, accuracy, and quality of the results.
  • The <details> option returns the distinctive terms (these are cts terms) used for each cluster. You can use these to try and construct your own labels by generating cts:query constructors from each term. You can then use those queries against some of your data to generate some labels, if that makes sense for your application.
  • The <algorithm> option sets the algorithm MarkLogic Server uses to calculate the clusters: k-means or lsi. Both are statistical algorithms and have well-known and published papers describing them (to learn more, you can start here: http://en.wikipedia.org/wiki/K-means_clustering and http://en.wikipedia.org/wiki/Latent_semantic_indexing). The default is k-means, which tends to be slightly faster, but gives slightly less stable results than lsi.
  • You can control the number of clusters using <min-clusters> and <max-clusters> settings. It is possible for cts:cluster to return less than the number of clusters in <min-clusters> if the most it can calculate based on your data is less than that value.
  • The <num-tries> option specifies the number of times to run the clusterer against the specified data. The default is 1. Because of the way the algorithms work, running the cluster multiple times will increase the number of terms, and tends to improve the accuracy of the clusters. It does so at the cost of performance, as each time it runs, it has to do more work.

Indexing (db:) Options

The indexing options control which terms are created. MarkLogic Server uses these terms to calculate the clusters, based on term frequency, distinctive terms, and other factors relating to relevancy. Note the following about the db options:

  • They are set in the options node, and are in the http://marklogic.com/xdmp/database namespace.
  • The cts:cluster database options are the same as the database options for cts:distinctive-terms.
  • You can construct the options by hand or use the Admin API to construct the options.
  • Fields are a good way of indexing only the words you are interested in, and allows you to set weights for certain elements. For details on how fields work, see Fields Database Settings in the Administrator's Guide.
  • The <use-db-options> cts:cluster option (in the cts:cluster namespace) takes the combination of the database options set in the context database, the specified database options, and any default values for options. This can be a convenient way for setting complicated options.
  • Iterate with different options to get the right mix of performance and term choices.

Understanding the cts:cluster Output

The following shows sample cts:cluster output:

<clustering xmlns="http://marklogic.com/cts">
  <cluster id="15899142696064772767" label="law, his, hath" count="8" nodes="2 11 22 24 27 30 40 78"/>
  <cluster id="161987570467386344" label="earth, lose, hast" count="1" nodes="28"/>
  <cluster id="14947979602052601851" label="mark, most, talbot" count="91" nodes="1 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 23 25 26 29 31 32 33 34 35 36 37 38 39 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100"/>
  <cluster id="143845517505877166" parent-id="15899142696064772767" label="note, captain, antony" count="4" nodes="2 22 30 40"/>
  <cluster id="12625796822979427066" parent-id="15899142696064772767" label="king, from, so" count="4" nodes="11 24 27 78"/>
  <cluster id="9134217245415181471" parent-id="14947979602052601851" label="talbot, somerset, who" count="4" nodes="62 72 73 74"/>
  <cluster id="1248501351668626361" parent-id="14947979602052601851" label="pompey, wall, cleopatra" count="44" nodes="1 4 5 6 12 13 14 19 33 34 37 39 41 42 45 46 47 48 49 50 51 53 54 55 56 58 60 61 64 65 68 71 75 77 84 87 88 89 92 95 96 97 98 99"/>
  <cluster id="6447791006134911106" parent-id="14947979602052601851" label="our, voice, these" count="10" nodes="17 29 59 69 79 80 91 93 94 100"/>
  <cluster id="7874080124275500326" parent-id="14947979602052601851" label="which, peace, blood" count="33" nodes="3 7 8 9 10 15 16 18 20 21 23 25 26 31 32 35 36 38 43 44 52 57 63 66 67 70 76 81 82 83 85 86 90"/>
  <options xmlns="cts:cluster" xmlns:db="http://marklogic.com/xdmp/database">
    <label-ignore-words>a as of s the when</label-ignore-words>

The output is a cts:clustering element. The output includes each cluster, as well as the options node used to create it. You can use XQuery or XSLT to iterate through the output, creating a report (for example, in HTML) of the results.

The attributes on the <cluster> element describe the cluster. The following table describes the attributes on the <cluster> element:

cluster Attribute Description
A random number used to identify the cluster.
The ID of the parent cluster, when <hierarchical-levels> is set to a value greater than 1.
The terms that comprise the label, comma separated. To make your own label, return the <details> and use the terms to generate a label.
The number of nodes in the cluster.
A set of NMTOKEN values, where each value lists the position of the node. The position is ordered by relevance, the first being the most relevant to the cluster and the last being the least relevant. The number refers to the position in the nodes input to cts:cluster. For example, a value of 10 indicates that it is the tenth node in the sequence passed into the first parameter of cts:cluster.

Example that Creates an HTML Report of the Cluster

The following example creates an HTML report of the cluster. It uses the Shakespeare plays database. To see the results, cut and paste the example and run it against a database that contains the Shakespeare plays (modify the URI of the directory used in the cts:search to the URI of the database directory in which you have loaded the Shakespeare plays).

xquery version "1.0-ml" ;

(: cluster the Shakespeare speeches, disregarding the speaker, 
   and show the results in an html table :)

declare namespace db="http://marklogic.com/xdmp/database" ;
declare namespace cl="cts:cluster" ;
declare namespace dt="cts:distinctive-terms" ;

(: generally we want to cluster the top N results, where N is 
   around 100 to 1,000 (smaller numbers for best performance).
   all speeches = 31,029; 
   speeches that contain "love" = 1,864; 
   "war" = 359; "joy" = 201; 
   "beast" = 94; 
let $search-term := xdmp:get-request-field("search-term", "aunt")
let $max-terms := xdmp:get-request-field("max-terms", "100")   
let $use-db-config := 
  xdmp:get-request-field("use-db-config", "false")   
let $algorithm := xdmp:get-request-field("algorithm", "k-means") 
let $options-node :=
   <options xmlns="cts:cluster" >
      <label-ignore-words>a of the when s as</label-ignore-words>
      <algorithm>{ $algorithm }</algorithm>
      <!-- turn all database-level indexing options OFF - only use field terms -->
        <field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://marklogic.com/xdmp/database">
          <!-- create stem and phrase terms for this field -->
          <!-- if the XML were richer, we would have used 
               fast-element-word-searches and 
               fast-element-phrase-searches too -->

(: build the page :)
let $page :=
<head><title>Example - clustering - speeches</title></head>
<table border="1" cellpadding="1" cellspacing="1">
let $things-to-cluster := 
     (: specify the directory in which you have loaded the plays :)
     xdmp:directory( "/shakespeare/plays/" )//SPEECH, 
(: iterate through the cts:cluster results node :)
for $cluster in 
    cts:cluster( $things-to-cluster, $options-node )/cts:cluster
     <td>{ fn:data( $cluster/@label ) }</td>
     <td>{ fn:data( $cluster/@count ) }</td>
 for $clustered-node-ref in fn:data( $cluster/@nodes )
     <tr><td>{ fn:string( 
       $things-to-cluster[$clustered-node-ref]//SPEAKER ) 

return ( xdmp:set-response-content-type("text/html"), 
 $page, xdmp:elapsed-time() )
« Previous chapter
Next chapter »