MarkLogic Server includes cts:cluster
, which uses statistical algorithms to find and label clusters of search results. This chapter describes cts:cluster
and includes the following sections:
For details about the signature, the parameter syntax, and more examples, see cts:cluster in the MarkLogic XQuery and XSLT Function Reference.
The cts:cluster function takes a set of nodes, typically from a search result set (although it can be any set of nodes), and provides a report that categorizes the result nodes in clusters. A cluster is a subset of the results that are statistically similar. For each cluster, it generates a label from the most distinctive terms in that cluster.
The output is an XML node, and you can use the output to generate a user interface that displays the results. For sample output, see Understanding the cts:cluster Output.
The clusterer creates clusters by taking the nodes you pass into cts:cluster and running it through the MarkLogic Server indexer. This is very similar to the process when you load a document into the database, but the indexing for results clustering is all done in memory, whereas in the database the indexes are stored to disk. The product of indexing is terms, with each term having a frequency (the number of times it occurs in the document and in the result set). Depending on which index settings you use, you will get a different set of terms. The clusterer takes into account each of the terms, as well as information about the terms (for example, weights and term frequency), to calculate the clusters.
You pass options into cts:cluster that determine the behavior of the cluster as well as specify the index settings to use when creating the clusters. For more information about the options, see Options to cts:cluster, as well as the API documentation for cts:cluster in the MarkLogic XQuery and XSLT Function Reference.
When deciding how to use the clusterer, think about what your requirements are. Many settings you choose in the clusterer are trade-offs between performance and the quality of the results clusters. You might need to experiment to find what works well for your application.
Note the following about the clusterer:
<hierararchical-levels>
to a value of greater than 1 to generate clusters of clusters. The parent attribute
tells you which cluster is its parent. You can then iterate through the result set to create a user interface that shows the tree-like hierarchy.<num-tries>
tends to make the labels more consistent from run-to-run, but will increase the time it takes to produce the clusters.<details>true</details>
option.You can set options to cts:cluster in an options node. You can set the following types of options:
Each of these types of options is in its own namespace.
The clustering options are in the cts:cluster namespace. These options determine the output and the behavior of the clusterer. Note the following about the clusterer options:
<details>
option returns the distinctive terms (these are cts
terms) used for each cluster. You can use these to try and construct your own labels by generating cts:query
constructors from each term. You can then use those queries against some of your data to generate some labels, if that makes sense for your application.<algorithm>
option sets the algorithm MarkLogic Server uses to calculate the clusters: k-means
or lsi
. Both are statistical algorithms and have well-known and published papers describing them (to learn more, you can start here: http://en.wikipedia.org/wiki/K-means_clustering and http://en.wikipedia.org/wiki/Latent_semantic_indexing). The default is k-means
, which tends to be slightly faster, but gives slightly less stable results than lsi
.<min-clusters>
and <max-clusters>
settings. It is possible for cts:cluster to return less than the number of clusters in <min-clusters>
if the most it can calculate based on your data is less than that value.<num-tries>
option specifies the number of times to run the clusterer against the specified data. The default is 1. Because of the way the algorithms work, running the cluster multiple times will increase the number of terms, and tends to improve the accuracy of the clusters. It does so at the cost of performance, as each time it runs, it has to do more work. The indexing options control which terms are created. MarkLogic Server uses these terms to calculate the clusters, based on term frequency, distinctive terms, and other factors relating to relevancy. Note the following about the db
options:
http://marklogic.com/xdmp/database
namespace.<use-db-options>
cts:cluster option (in the cts:cluster namespace) takes the combination of the database options set in the context database, the specified database options, and any default values for options. This can be a convenient way for setting complicated options.The following shows sample cts:cluster
output:
<clustering xmlns="http://marklogic.com/cts"> <cluster id="15899142696064772767" label="law, his, hath" count="8" nodes="2 11 22 24 27 30 40 78"/> <cluster id="161987570467386344" label="earth, lose, hast" count="1" nodes="28"/> <cluster id="14947979602052601851" label="mark, most, talbot" count="91" nodes="1 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 23 25 26 29 31 32 33 34 35 36 37 38 39 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100"/> <cluster id="143845517505877166" parent-id="15899142696064772767" label="note, captain, antony" count="4" nodes="2 22 30 40"/> <cluster id="12625796822979427066" parent-id="15899142696064772767" label="king, from, so" count="4" nodes="11 24 27 78"/> <cluster id="9134217245415181471" parent-id="14947979602052601851" label="talbot, somerset, who" count="4" nodes="62 72 73 74"/> <cluster id="1248501351668626361" parent-id="14947979602052601851" label="pompey, wall, cleopatra" count="44" nodes="1 4 5 6 12 13 14 19 33 34 37 39 41 42 45 46 47 48 49 50 51 53 54 55 56 58 60 61 64 65 68 71 75 77 84 87 88 89 92 95 96 97 98 99"/> <cluster id="6447791006134911106" parent-id="14947979602052601851" label="our, voice, these" count="10" nodes="17 29 59 69 79 80 91 93 94 100"/> <cluster id="7874080124275500326" parent-id="14947979602052601851" label="which, peace, blood" count="33" nodes="3 7 8 9 10 15 16 18 20 21 23 25 26 31 32 35 36 38 43 44 52 57 63 66 67 70 76 81 82 83 85 86 90"/> <options xmlns="cts:cluster" xmlns:db="http://marklogic.com/xdmp/database"> <algorithm>k-means</algorithm> <db:word-searches>true</db:word-searches> <db:fast-phrase-searches>false</db:fast-phrase-searches> <db:fast-element-word-searches>true</db:fast-element-word-searches> <db:language>en</db:language> <max-clusters>10</max-clusters> <min-clusters>3</min-clusters> <hierarchical-levels>2</hierarchical-levels> <initialization>smart</initialization> <max-terms>200</max-terms> <label-max-terms>3</label-max-terms> <label-ignore-words>a as of s the when</label-ignore-words> <num-tries>1</num-tries> <score>logtfidf</score> <use-db-config>false</use-db-config> <details>false</details> <overlapping>false</overlapping> </options> </clustering>
The output is a cts:clustering
element. The output includes each cluster, as well as the options node used to create it. You can use XQuery or XSLT to iterate through the output, creating a report (for example, in HTML) of the results.
The attributes on the <cluster>
element describe the cluster. The following table describes the attributes on the <cluster>
element:
The following example creates an HTML report of the cluster. It uses the Shakespeare plays database. To see the results, cut and paste the example and run it against a database that contains the Shakespeare plays (modify the URI of the directory used in the cts:search to the URI of the database directory in which you have loaded the Shakespeare plays).
xquery version "1.0-ml" ; (: cluster the Shakespeare speeches, disregarding the speaker, and show the results in an html table :) declare namespace db="http://marklogic.com/xdmp/database" ; declare namespace cl="cts:cluster" ; declare namespace dt="cts:distinctive-terms" ; (: generally we want to cluster the top N results, where N is around 100 to 1,000 (smaller numbers for best performance). all speeches = 31,029; speeches that contain "love" = 1,864; "war" = 359; "joy" = 201; "beast" = 94; "aunt"=24 :) let $search-term := xdmp:get-request-field("search-term", "aunt") let $max-terms := xdmp:get-request-field("max-terms", "100") let $use-db-config := xdmp:get-request-field("use-db-config", "false") let $algorithm := xdmp:get-request-field("algorithm", "k-means") let $options-node := <options xmlns="cts:cluster" > <hierarchical-levels>5</hierarchical-levels> <overlapping>false</overlapping> <label-max-terms>1</label-max-terms> <label-ignore-words>a of the when s as</label-ignore-words> <max-clusters>10</max-clusters> <algorithm>{ $algorithm }</algorithm> <!-- turn all database-level indexing options OFF - only use field terms --> <db:word-searches>false</db:word-searches> <db:stemmed-searches>false</db:stemmed-searches> <db:fast-case-sensitive-searches>false</db:fast-case-sensitive-searches> <db:fast-diacritic-sensitive-searches>false</db:fast-diacritic-sensitive-searches> <db:fast-phrase-searches>false</db:fast-phrase-searches> <db:phrase-throughs/> <db:phrase-arounds/> <db:fast-element-word-searches>false</db:fast-element-word-searches> <db:fast-element-phrase-searches>false</db:fast-element-phrase-searches> <db:element-word-query-throughs/> <db:fast-element-character-searches>false</db:fast-element-character-searches> <db:range-element-indexes/> <db:range-element-attribute-indexes/> <db:one-character-searches>false</db:one-character-searches> <db:two-character-searches>false</db:two-character-searches> <db:three-character-searches>false</db:three-character-searches> <db:trailing-wildcard-searches>false</db:trailing-wildcard-searches> <db:fast-element-trailing-wildcard-searches>false</db:fast-element-trailing-wildcard-searches> <db:fields> <field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://marklogic.com/xdmp/database"> <field-name>speeches</field-name> <include-root>false</include-root> <word-lexicons/> <!-- create stem and phrase terms for this field --> <!-- if the XML were richer, we would have used fast-element-word-searches and fast-element-phrase-searches too --> <stemmed-searches>advanced</stemmed-searches> <db:fast-phrase-searches>true</db:fast-phrase-searches> <included-elements> <included-element> <namespace-uri/> <localname>LINE</localname> <weight>1.0</weight> <attribute-namespace-uri/> <attribute-localname/> <attribute-value/> </included-element> <included-element> <namespace-uri/> <localname>SPEECH</localname> <weight>1.0</weight> <attribute-namespace-uri/> <attribute-localname/> <attribute-value/> </included-element> </included-elements> <excluded-elements> <excluded-element> <namespace-uri/> <localname>SPEAKER</localname> </excluded-element> </excluded-elements> </field> </db:fields> </options> (: build the page :) let $page := <html> <head><title>Example - clustering - speeches</title></head> <body> <table border="1" cellpadding="1" cellspacing="1"> <tr> <th>Label</th> <th>Count</th> <th>Speakers</th> </tr> { let $things-to-cluster := cts:search( (: specify the directory in which you have loaded the plays :) xdmp:directory( "/shakespeare/plays/" )//SPEECH, $search-term ) (: iterate through the cts:cluster results node :) for $cluster in cts:cluster( $things-to-cluster, $options-node )/cts:cluster return <tr> <td>{ fn:data( $cluster/@label ) }</td> <td>{ fn:data( $cluster/@count ) }</td> <td> <table>{ for $clustered-node-ref in fn:data( $cluster/@nodes ) return <tr><td>{ fn:string( $things-to-cluster[$clustered-node-ref]//SPEAKER ) }</td></tr> }</table> </td> </tr>} </table> </body> </html> return ( xdmp:set-response-content-type("text/html"), $page, xdmp:elapsed-time() )