MarkLogic Server includes cts:similar-query and cts:distinctive-terms. With these search APIs, you can find what is distinctive about nodes, typically from search results, from a search perspective. This chapter describes cts:similar-query and cts:distinctive-terms, and includes the following sections:
You can use cts:similar-query to find nodes that are similar, from a search perspecitve, to the model nodes that you pass into the first parameter. The cts:similar-query constructor is a cts:query
constructor, and you can combine it with other cts:query
constructors as described in Composing cts:query Expressions.
Instead of looking in the indexes to find the terms that match the query, like other cts:query
constructors, cts:similar-query takes the nodes passed in, runs them through an indexing process, and returns a cts:query
that would match the model nodes with a high degree of relevance. You can pass various index and score options into cts:similar-query to influence the cts:query
that it produces.
The query that it generates finds distinctive terms of the model nodes based on the other documents in the database.
If you want to find the terms that cts:similar-query uses to generate its cts:query
, you can use cts:distinctive-terms. The output of cts:distinctive-terms is a cts:class
element with several cts:term
children. Each cts:term
element contains a cts:query constructor, representing a term. Each cts:term
element also contains scores and confidence for that term. MarkLogic Server uses these scores in calculating relevance.
You can pass many different options into cts:distinctive-terms to control which terms it generates. The database options control which terms will be most relevant to the model nodes, and therefore affect the cts:distinctive-terms
output. If you take an iterative approach, you can try different indexing options to see which ones give the best results for your model nodes.
The distinctive terms generated or distinctive based on the other documents in the database, therfore, you will get much better results running this against a sizable database.
The following shows a simple cts:distinctive-terms query with its output:
let $node := doc("/shakespeare/plays/hamlet.xml") return cts:distinctive-terms($node, <options xmlns="cts:distinctive-terms" xmlns:db="http://marklogic.com/xdmp/database"> <use-db-config>false</use-db-config> <max-terms>3</max-terms> <db:word-searches>false</db:word-searches> <db:stemmed-searches>basic</db:stemmed-searches> <db:fast-phrase-searches>false</db:fast-phrase-searches> <db:fast-element-word-searches>false</db:fast-element-word-searches> <db:fast-element-phrase-searches>false</db:fast-element-phrase-searches> </options>) => <cts:class name="dterms /shakespeare/plays/hamlet.xml" offset="0" xmlns:cts="http://marklogic.com/cts"> <cts:term id="7783238741996929314" val="981" score="981" confidence="0.811494" fitness="1"> <cts:word-query> <cts:text xml:lang="en">guildenstern</cts:text> <cts:option>case-insensitive</cts:option> <cts:option>diacritic-insensitive</cts:option> <cts:option>stemmed</cts:option> <cts:option>unwildcarded</cts:option> </cts:word-query> </cts:term> <cts:term id="4731147985682913359" val="956" score="956" confidence="0.801087" fitness="1"> <cts:word-query> <cts:text xml:lang="en">polonius</cts:text> <cts:option>case-insensitive</cts:option> <cts:option>diacritic-insensitive</cts:option> <cts:option>stemmed</cts:option> <cts:option>unwildcarded</cts:option> </cts:word-query> </cts:term> <cts:term id="1100490632300558572" val="949" score="949" confidence="0.798149" fitness="1"> <cts:word-query> <cts:text xml:lang="en">horatio</cts:text> <cts:option>case-insensitive</cts:option> <cts:option>diacritic-insensitive</cts:option> <cts:option>stemmed</cts:option> <cts:option>unwildcarded</cts:option> </cts:word-query> </cts:term> </cts:class>
The output is a cts:class
element, and each child is a cts:term
element. The cts:term
elements represent terms in a database, identified by a cts:query
. Each term has numbers for val
, score
, confidence
, and fitness
.
The val
and score
attributes are values that approximate the score contribution of that term. The confidence
attribute represents the cts:confidence value for the term. The fitness
attribute represents the cts:fitness value for the term. For details on score, fitness, and confidence, see Relevance Scores: Understanding and Customizing.
The previous query only consider word-query terms. You can also have cts:element-word-query terms and cts:near-query terms for terms that are within an element or that are a word pair (a cts:near-query with a distance
of 1). To see some of these kind of terms, try running a query like the following:
let $node := doc("/shakespeare/plays/hamlet.xml") return cts:distinctive-terms($node, <options xmlns="cts:distinctive-terms" xmlns:db="http://marklogic.com/xdmp/database"> <use-db-config>false</use-db-config> <max-terms>100</max-terms> <db:word-searches>false</db:word-searches> <db:stemmed-searches>basic</db:stemmed-searches> <db:fast-phrase-searches>true</db:fast-phrase-searches> <db:fast-element-word-searches>true</db:fast-element-word-searches> <db:fast-element-phrase-searches>true</db:fast-element-phrase-searches> </options>)
This query enables the db:fast-element-word-searches
and db:fast-element-phrase-searches
options, which will cause terms to appear in the output that are constrained to a particular element. Changing the database options to cts:distictive-terms
and looking at the differences in the output will help you to understand both how the index options affect which terms are distinctive and, since cts:similar-query can use these same settings, how cts:similar-query decides if a document is similar to the model nodes.
Tag clouds are a popular visualization that show various terms, usually relevant to a search, and show the more relevant ones in a larger and/or more colorful font. You can use cts:distinctive-terms feed the data used to make a tag cloud. The basic design pattern is as follows:
max-terms
size that is equal to the number of terms you want in your tag cloud.The following sample code is a simplied example of this design pattern:
xquery version "1.0-ml"; let $hits := let $terms := let $node := doc("/shakespeare/plays/hamlet.xml")//LINE return cts:distinctive-terms($node, <options xmlns="cts:distinctive-terms" xmlns:db="http://marklogic.com/xdmp/database"> <use-db-config>false</use-db-config> <max-terms>100</max-terms> <db:word-searches>false</db:word-searches> <db:stemmed-searches>basic</db:stemmed-searches> <db:fast-phrase-searches>false</db:fast-phrase-searches> <db:fast-element-word-searches>false</db:fast-element-word-searches> <db:fast-element-phrase-searches>false</db:fast-element-phrase-searches> </options>)//cts:term for $wq in $terms where $wq/cts:word-query return element word { attribute score { fn:round(($wq/@val div 20))}, $wq/cts:word-query/cts:text/string() } return <p>{ for $hit in $hits order by $hit/string() return ( <span style="{fn:concat("font-size: ", $hit/@score)}">{$hit/string()} </span>, " " ) }</p>
The above query returns html which, when displayed in a browser, shows the 100 most distinctive with the most relevant terms in a larger font.