MarkLogic 10 Product Documentation
cts:distinctive-terms

cts:distinctive-terms(
   $nodes as node()*,
   [$options as element()?]
) as element(cts:class)

Summary

Return the most "relevant" terms in the model nodes (that is, the terms with the highest scores).

Parameters

nodes Some model nodes.

Parameters
nodes	Some model nodes.
options	An XML representation of the options for defining which terms to generate and how to evaluate them. The options node must be in the `cts:distinctive-terms` namespace. The following is a sample options node: <options xmlns="cts:distinctive-terms"> <max-terms>20</max-terms> </options> The `cts:distinctive-terms` options (which are also valid for `cts:similar-query`, `cts:train`, and `cts:cluster`) include: <`max-terms`> An integer defining the maximum number of distinctive terms to list in the `cts:distinctive-terms` output. The default is 16. <`min-val`> A double specifying the minimum value a term can have and still be considered a distinctive term. The default is 0. <`min-weight`> A number specifying the minimum weighted term frequency a term can have and still be considered a distinctive term. In general this value will be either 0 (include unweighted terms) or 1 (don't include unweighted terms). The default is 1. <`score`> A string defining which scoring method to use in comparing the values of the terms. The default is `logtfidf`. See the description of scoring methods in the cts:search function for more details. Possible values are: `logtfidf` Compute scores using the logtfidf method. `logtf` Compute scores using the logtf method. `simple` Compute scores using the simple method. <`complete`> A boolean value indicating whether to return terms even if there is no query associated with them. The default is `false`. <`use-db-config`> The options below may be used to easily target a small set of terms. <`use-db-config`> is a boolean value indicating whether to use the currently configured DB options as defaults (overriding the built-in ones below) to determine the terms to generate. This is `true` by default. When this is `false`, any options below not explicitly specified take their default values as listed; they do not take the database settings' values. Flags explicitly specified override defaults, whether built-in (listed below), or from the database configuration. Flags not specified in a field apply to all fields, unless the field has its own setting, which will be the final value. In other words it's a hierarchy, with each more-specific level overriding previous less-specific levels. The options element also includes indexing options in the `http://marklogic.com/xdmp/database` namespace. These control which terms to use. These database options include the following (shown here with a `db` prefix to denote the `http://marklogic.com/xdmp/database` namespace. The default given below is the default value if `use-db-config` is set to `false`: <`db:word-searches`> Include terms for the words in the node. The default is `false`. <`db:stemmed-searches`> Define whether to include terms for the stems in the node, and at what level of stemming: `off`, `basic`, `advanced`, or `decompounding`. The default is `basic`. <`db:word-positions`> Include terms for word positions in the node. The default is `false`. <`db:fast-case-sensitive-searches`> Include terms for case-sensitive variations of the words in the node. The default is `false`. <`db:fast-diacritic-sensitive-searches`> Include terms for diacritic-sensitive variations of the words in the node. The default is `false`. <`db:fast-phrase-searches`> Include terms for two-word phrases in the node. The default is `true`. <`db:phrase-throughs`> If phrase terms are included, include terms for phrases that cross the given elements. The default is to have no such elements. Any number can be passed in a single string, separated by spaces. <`db:phrase-arounds`> If phrase terms are included, include terms for phrases that skip over the given elements. The default is to have no such elements. Any number can be passed in a single string, separated by spaces. <`db:fast-element-word-searches`> Include terms for words in particular elements. The default is `true`. <`db:fast-element-phrase-searches`> Include terms for phrases in particular elements. The default is `true`. <`db:element-word-positions`> Include terms for element word positions in the node. The default is `false`. <`db:element-word-query-throughs`> Include terms for words in sub-elements of the given elements. The default is to have no such elements. Any number can be passed in a single string, separated by spaces. <`db:fast-element-character-searches`> Include terms for characters in particular elements. The default is `false`. <`db:range-element-indexes`> Include terms for data values in specific elements. The default is to have no such indexes. <`db:range-field-indexes`> Include terms for data values in specific fields. The default is to have no such indexes. <`db:range-element-attribute-indexes`> Include terms for data values in specific attributes. The default is to have no such indexes. <`db:one-character-searches`> Include terms for single character. The default is `false`. <`db:two-character-searches`> Include terms for two-character sequences. The default is `false`. <`db:three-character-searches`> Include terms three-character sequences. The default is `false`. <`db:trailing-wildcard-searches`> Include terms for trailing wildcards. The default is `false`. <`db:fast-element-trailing-wildcard-searches`> If trailing wildcard terms are included, include terms for trailing wildcards by element. The default is `false`. <`db:fields`> Include terms for the defined fields. The default is to have no fields.

options

An XML representation of the options for defining which terms to generate and how to evaluate them. The options node must be in the cts:distinctive-terms namespace. The following is a sample options node:

    <options xmlns="cts:distinctive-terms">
      <max-terms>20</max-terms>
    </options>

The cts:distinctive-terms options (which are also valid for cts:similar-query, cts:train, and cts:cluster) include:

<max-terms>

An integer defining the maximum number of distinctive terms to list in the cts:distinctive-terms output. The default is 16.

<min-val>

A double specifying the minimum value a term can have and still be considered a distinctive term. The default is 0.

<min-weight>

A number specifying the minimum weighted term frequency a term can have and still be considered a distinctive term. In general this value will be either 0 (include unweighted terms) or 1 (don't include unweighted terms). The default is 1.

<score>

A string defining which scoring method to use in comparing the values of the terms. The default is logtfidf. See the description of scoring methods in the cts:search function for more details. Possible values are:

logtfidf: Compute scores using the logtfidf method.
logtf: Compute scores using the logtf method.
simple: Compute scores using the simple method.

<complete>

A boolean value indicating whether to return terms even if there is no query associated with them. The default is false.

<use-db-config>

The options below may be used to easily target a small set of terms. <use-db-config> is a boolean value indicating whether to use the currently configured DB options as defaults (overriding the built-in ones below) to determine the terms to generate. This is true by default. When this is false, any options below not explicitly specified take their default values as listed; they do not take the database settings' values. Flags explicitly specified override defaults, whether built-in (listed below), or from the database configuration. Flags not specified in a field apply to all fields, unless the field has its own setting, which will be the final value. In other words it's a hierarchy, with each more-specific level overriding previous less-specific levels.

The options element also includes indexing options in the http://marklogic.com/xdmp/database namespace. These control which terms to use.

These database options include the following (shown here with a db prefix to denote the http://marklogic.com/xdmp/database namespace. The default given below is the default value if use-db-config is set to false:

<db:word-searches>: Include terms for the words in the node. The default is false.
<db:stemmed-searches>: Define whether to include terms for the stems in the node, and at what level of stemming: off, basic, advanced, or decompounding. The default is basic.
<db:word-positions>: Include terms for word positions in the node. The default is false.
<db:fast-case-sensitive-searches>: Include terms for case-sensitive variations of the words in the node. The default is false.
<db:fast-diacritic-sensitive-searches>: Include terms for diacritic-sensitive variations of the words in the node. The default is false.
<db:fast-phrase-searches>: Include terms for two-word phrases in the node. The default is true.
<db:phrase-throughs>: If phrase terms are included, include terms for phrases that cross the given elements. The default is to have no such elements. Any number can be passed in a single string, separated by spaces.
<db:phrase-arounds>: If phrase terms are included, include terms for phrases that skip over the given elements. The default is to have no such elements. Any number can be passed in a single string, separated by spaces.
<db:fast-element-word-searches>: Include terms for words in particular elements. The default is true.
<db:fast-element-phrase-searches>: Include terms for phrases in particular elements. The default is true.
<db:element-word-positions>: Include terms for element word positions in the node. The default is false.
<db:element-word-query-throughs>: Include terms for words in sub-elements of the given elements. The default is to have no such elements. Any number can be passed in a single string, separated by spaces.
<db:fast-element-character-searches>: Include terms for characters in particular elements. The default is false.
<db:range-element-indexes>: Include terms for data values in specific elements. The default is to have no such indexes.
<db:range-field-indexes>: Include terms for data values in specific fields. The default is to have no such indexes.
<db:range-element-attribute-indexes>: Include terms for data values in specific attributes. The default is to have no such indexes.
<db:one-character-searches>: Include terms for single character. The default is false.
<db:two-character-searches>: Include terms for two-character sequences. The default is false.
<db:three-character-searches>: Include terms three-character sequences. The default is false.
<db:trailing-wildcard-searches>: Include terms for trailing wildcards. The default is false.
<db:fast-element-trailing-wildcard-searches>: If trailing wildcard terms are included, include terms for trailing wildcards by element. The default is false.
<db:fields>: Include terms for the defined fields. The default is to have no fields.

Usage Notes

Output Format The output of the function is a cts:class element containing a sequence of cts:term elements. (This is the same as the weights form of a class for the SVM classifier; see cts:train.) Each cts:term element identifies the term ID as well as a score, confidence, and fitness measure for the term, in addition to a cts:query that corresponds to the term. The correspondence of terms to queries is not precise: queries typically make use of multiple terms, and not all terms correspond to a query. However, a search using the query given for a term will match the model node that gave rise to it.

Example

cts:distinctive-terms( fn:doc("book.xml"),
   <options xmlns="cts:distinctive-terms"><max-terms>3</max-terms></options> )
== >
<cts:class name="dterms book.xml" offset="0" xmlns:cts="http://marklogic.com/cts">
  <cts:term id="1230725848944963443" val="482" score="372" confidence="0.686441" fitness="0.781011">
    <cts:element-word-query>
      <cts:element>title</cts:element>
      <cts:text xml:lang="en">the</cts:text>
      <cts:option>case-insensitive</cts:option>
      <cts:option>diacritic-insensitive</cts:option>
      <cts:option>stemmed</cts:option>
      <cts:option>unwildcarded</cts:option>
    </cts:element-word-query>
  </cts:term>
  <cts:term id="2859044029148442125" val="435" score="662" confidence="0.922555" fitness="0.971371">
    <cts:word-query>
      <cts:text xml:lang="en">text</cts:text>
      <cts:option>case-insensitive</cts:option>
      <cts:option>diacritic-insensitive</cts:option>
      <cts:option>stemmed</cts:option>
      <cts:option>unwildcarded</cts:option>
    </cts:word-query>
  </cts:term>
  <cts:term id="17835615465481541363" val="221" score="237" confidence="0.65647" fitness="0.781263">
    <cts:word-query>
      <cts:text xml:lang="en">of</cts:text>
      <cts:option>case-insensitive</cts:option>
      <cts:option>diacritic-insensitive</cts:option>
      <cts:option>stemmed</cts:option>
      <cts:option>unwildcarded</cts:option>
    </cts:word-query>
  </cts:term>
</cts:class>

Example

cts:distinctive-terms(//title,
    <options xmlns="cts:distinctive-terms">
      <use-db-config>true</use-db-config>
    </options>)

=> a cts:class element containing the 16 most distinctive query terms

Example

cts:distinctive-terms(<foo>hello there you</foo>,
    <options xmlns="cts:distinctive-terms"
             xmlns:db="http://marklogic.com/xdmp/database">
            <db:word-positions>true</db:word-positions>
    </options>)

=> a cts:class element containing the 16 most distinctive query terms

MarkLogic 10 Product Documentationcts:distinctive-terms

Summary

Usage Notes

Example

Example

Example

MarkLogic 10 Product Documentation
cts:distinctive-terms