MarkLogic 9 Product Documentation
cts:train

cts:train(
   $training-nodes as node()*,
   $labels as element(cts:label)*,
   [$options as (element()|map:map)?]
) as element(cts:classifier)?

Summary

Produces a set of classifiers from a list of labeled training documents.

Parameters

$training-nodes The sequence of training nodes. These are nodes that represent members of the classes.

$labels A sequence of labels for the training nodes, in the order corresponding to the training nodes.

Parameters
$training-nodes	The sequence of training nodes. These are nodes that represent members of the classes.
$labels	A sequence of labels for the training nodes, in the order corresponding to the training nodes.
options	An XML representation of the options for defining the training parameters. The options node must be in the `cts:train` namespace. The following is a sample options node: <options xmlns="cts:train"> <classifier-type>supports</classifier-type> <kernel>geodesic</kernel> </options> The `cts:train` options include: <`classifier-type`> A string defining the kind of classifier to produce, either `weights` or `supports`. The default is `weights`. <`kernel`> A string defining which function to use for comparing documents. The default is `sqrt`. Normalization (the values that end in `-normalized`) brings document vectors into the unit sphere, which may improve the mathematical properties of the calculations. Possible values are: `simple` Model documents as 1 or 0 for presence or absence of each term. `simple-normalized` Like `simple`, but normalized by the square root of the document length. `sqrt` Model documents using the square root of the term frequencies. `sqrt-normalized` Like `sqrt`, but normalized by the sum of the term frequencies. `linear-normalized` Model documents as the term frequencies normalized by the square root of the sum of the squares of the term frequencies. `gaussian` Compare documents using the Gaussian of the term frequencies. Requires a `classifier-type` of `supports`. `geodesic` Compare documents using the Riemann geodesic distance over term frequencies. Requires a `classifier-type` of `supports`. <`max-terms`> An integer defining the maximum number of terms to use to represent each document. If a positive number M is given, then the M most discriminating terms are used; other terms are dropped. The default is 0 (unlimited), but for larger documents a value in 500 to 1000 range will produce much better results. <`max-support`> A double specifying the maximum influence a single training node can have. This parameter has a strong influence on performance. The default value of 1.0 should work well in most cases. Larger values means greater sensitivity and may improve accuracy on small datasets, but give longer running times. Smaller values mean less sensitivity and better resistance to mis-classified documents, and shorter running times. <`min-weight`> A double specifying the minimum weight a term can have and still be considered for inclusion in the term vector. This parameter only applies to the term weight form of the classifier. Smaller values mean longer term vectors and as a consequence longer running times and greater memory consumption during classification, but may also improve accuracy. The initial value may be adjusted downwards during training if a class would otherwise have no terms in its output vector. The default is is 0.01. <`tolerance`> How close the final solutions to the constraint equations must be. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The default is 0.01. <`epsilon`> How close a value must be to 0 to be counted as equal to 0. Since double arithmetic is not precise, setting this value to exactly 0 will likely lead to non-convergence of the algorithm. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The initial value may be adjusted downwards during execution if it is too large to be useful. In general the higher the dimensionality (larger documents, larger limits on the number of terms), the smaller this should be. The default is 0.01. <`max-iterations`> The maximum number of iterations of the constraint satisfaction algorithm to run. The algorithm usually converges very quickly, so this parameter usually has no effect unless it is set very low. The default is 500. `<thresholds>` A definition of the thresholds to use in classification. This is a complex element with one or more `<threshold>` children. You can specify both a default value and per-class values (as computed from `cts:thresholds` ). The default value will apply to any classes for which a per-class value is not specified. For example: <options xmlns="cts:train"> <thresholds> <threshold>-1.0</threshold> <threshold class="Example 1">-2.42</threshold> </thresholds> </options> For the initial tuning phase of training your data, leave the value of this parameter at its default value which is a very large negative number (-1.0e30). This will allow you to accurately compute the threshold values when you run `cts:thresholds` on the initial training data. Then you can use the calculated threshold values when you run the secondary pass through the second part of your training data. <`use-db-config`> A boolean value indicating whether to use the current DB configuration for determining which terms to use. The default is `false`, which means that only the indexing options in the options node will be used for calculating the classifier. The options element also includes indexing options in the `http://marklogic.com/xdmp/database` namespace. These control which terms to use. Note that the use of certain options, such as `fast-case-sensitive-searches` , will not impact final results unless the term vector size is limited with the `max-terms` option. Other options, such as `phrase-throughs` , will only generate terms if some other option is also enabled (in this case `fast-phrase-searches` ). The database options are the same as the database options shown for `cts:distinctive-terms` .

options

An XML representation of the options for defining the training parameters. The options node must be in the cts:train namespace. The following is a sample options node:

    <options xmlns="cts:train">
      <classifier-type>supports</classifier-type>
      <kernel>geodesic</kernel>
    </options>

The cts:train options include:

<classifier-type>

A string defining the kind of classifier to produce, either weights or supports. The default is weights.

<kernel>

A string defining which function to use for comparing documents. The default is sqrt. Normalization (the values that end in -normalized) brings document vectors into the unit sphere, which may improve the mathematical properties of the calculations. Possible values are:

simple: Model documents as 1 or 0 for presence or absence of each term.
simple-normalized: Like simple, but normalized by the square root of the document length.
sqrt: Model documents using the square root of the term frequencies.
sqrt-normalized: Like sqrt, but normalized by the sum of the term frequencies.
linear-normalized: Model documents as the term frequencies normalized by the square root of the sum of the squares of the term frequencies.
gaussian: Compare documents using the Gaussian of the term frequencies. Requires a classifier-type of supports.
geodesic: Compare documents using the Riemann geodesic distance over term frequencies. Requires a classifier-type of supports.

<max-terms>

An integer defining the maximum number of terms to use to represent each document. If a positive number M is given, then the M most discriminating terms are used; other terms are dropped. The default is 0 (unlimited), but for larger documents a value in 500 to 1000 range will produce much better results.

<max-support>

A double specifying the maximum influence a single training node can have. This parameter has a strong influence on performance. The default value of 1.0 should work well in most cases. Larger values means greater sensitivity and may improve accuracy on small datasets, but give longer running times. Smaller values mean less sensitivity and better resistance to mis-classified documents, and shorter running times.

<min-weight>

A double specifying the minimum weight a term can have and still be considered for inclusion in the term vector. This parameter only applies to the term weight form of the classifier. Smaller values mean longer term vectors and as a consequence longer running times and greater memory consumption during classification, but may also improve accuracy. The initial value may be adjusted downwards during training if a class would otherwise have no terms in its output vector. The default is is 0.01.

<tolerance>

How close the final solutions to the constraint equations must be. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The default is 0.01.

<epsilon>

How close a value must be to 0 to be counted as equal to 0. Since double arithmetic is not precise, setting this value to exactly 0 will likely lead to non-convergence of the algorithm. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The initial value may be adjusted downwards during execution if it is too large to be useful. In general the higher the dimensionality (larger documents, larger limits on the number of terms), the smaller this should be. The default is 0.01.

<max-iterations>

The maximum number of iterations of the constraint satisfaction algorithm to run. The algorithm usually converges very quickly, so this parameter usually has no effect unless it is set very low. The default is 500.

<thresholds>

A definition of the thresholds to use in classification. This is a complex element with one or more <threshold> children. You can specify both a default value and per-class values (as computed from cts:thresholds ). The default value will apply to any classes for which a per-class value is not specified. For example:

    <options xmlns="cts:train">
      <thresholds>
        <threshold>-1.0</threshold>
        <threshold class="Example 1">-2.42</threshold>
      </thresholds>
    </options>

For the initial tuning phase of training your data, leave the value of this parameter at its default value which is a very large negative number (-1.0e30). This will allow you to accurately compute the threshold values when you run cts:thresholds on the initial training data. Then you can use the calculated threshold values when you run the secondary pass through the second part of your training data.

<use-db-config>

A boolean value indicating whether to use the current DB configuration for determining which terms to use. The default is false, which means that only the indexing options in the options node will be used for calculating the classifier.

The options element also includes indexing options in the http://marklogic.com/xdmp/database namespace. These control which terms to use. Note that the use of certain options, such as fast-case-sensitive-searches , will not impact final results unless the term vector size is limited with the max-terms option. Other options, such as phrase-throughs , will only generate terms if some other option is also enabled (in this case fast-phrase-searches ).

The database options are the same as the database options shown for cts:distinctive-terms .

Usage Notes

The elements in the label sequence should match one for one with the nodes in the training node sequence. The first label element describes the first node in the training node sequence, the second label element describes the second node in the training node sequence, and so on. If there are more labels than training nodes or more training nodes than labels, an error is raised.

The format of each label element is:

  <cts:label name="Node1">
    <cts:class name="Example1"/>
    <cts:class name="Example2" val="-1"/>
        :   :
  </cts:label>

Each class listed indicates whether the corresponding node in the training sequence is in the given class. Examples are taken to be positive examples unless specified otherwise (with a val attribute of -1). The document is assumed to be a negative example of any classes that are not explicitly listed. The name attribute on the label element is an optional name for the labelled node. It is purely for human consumption to help in tuning the classification parameters.

Output Formats

A linear classifier is defined by a weight vector w on terms, and an offset value b. The <weights/> node encodes the weight vector directly. Its children are the classes, and each class includes a list of terms. The term node uses an internal id to identify the term and a term weight:

<weights>
  <class name="Example1" offset="2.04">
    <term id="43587329645324245" val="0.3423432"/>
    <term id="47893427895432534" val="-0.12345556"/>
      :                           :
  </class>
      :
</weights>

The weight vector w is a linear combination of the documents themselves, and it may be more convenient to express the classifier in this way. For instance, if the number of terms is not limited, the <weights/> node will be extremely large. The weight vector form may not be used if the classifier kernel is non-linear, that is, with the Gaussian or geodesic kernel.

The support vector representation of the classifier includes a supports node that has <class/> children for each class. Here the class elements contain a list of doc elements which identify the specific training nodes using an internal key. This internal key is valid across queries only for nodes in the database. It is strongly recommended that the training set for supports classifiers consist of whole documents only. Each doc element has an attribute encoding the weight of that document and an error attribute which shows how well the document fit the classifier. Large positive or negative errors (greater than about 1.5) are potentially mis-classified documents.

<supports>
  <class name="Example1" offset="2.04">
    <doc id="155584958759" name="Node102" val="-0.00334163" err="1.4"/>
    <doc id="594064848864" name="Node57" val="0.025341234" err="-2.3"/>
      :                             :
  </class>
      :
</supports>

Each class is identified by a unique name.

Example

let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $labels := for $x in $firsthalf
  return
  <cts:label>
    <cts:class 
        name="{xdmp:document-properties(xdmp:node-uri($x))//playtype/text()}"/>
   </cts:label>
return
cts:train($firsthalf, $labels,
       <options xmlns="cts:train">
         <classifier-type>supports</classifier-type>
       </options>)

  =>  
<classifier xmlns="http://marklogic.com/cts">
  <options xmlns="cts:train" xmlns:db="http://marklogic.com/xdmp/database">
    <kernel>sqrt</kernel>
    <classifier-type>supports</classifier-type>
    <db:language>en</db:language>
    <min-weight>0.01</min-weight>
    <max-terms>0</max-terms>
    <max-iterations>500</max-iterations>
    <max-support>1</max-support>
    <tolerance>0.01</tolerance>
    <epsilon>0.01</epsilon>
    <thresholds>
      <threshold>-1.0E30</threshold>
    </thresholds>
  </options>
  <supports>
    <class name="HISTORY" offset="1"/>
    <class name="COMEDY" offset="1"/>
    <class name="TRAGEDY" offset="1"/>
  </supports>
</classifier>

Example

xquery version "1.0-ml";

(:
   this is the same as the previous example, except it uses the
   use-db-config option
:)
let $firsthalf :=
  xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $labels := for $x in $firsthalf
  return
  <cts:label>
     <cts:class 
        name="{xdmp:document-properties(xdmp:node-uri($x))
	          //playtype/fn:string()}"/>
  </cts:label>
return
cts:train($firsthalf, $labels,
       <options xmlns="cts:train">
         <classifier-type>supports</classifier-type>
         <use-db-config>true</use-db-config>
       </options>)

  =>  <cts:classifier>...

MarkLogic 9 Product Documentationcts:train

Summary

Usage Notes

Example

Example

MarkLogic 9 Product Documentation
cts:train