cts.train

cts.train(
   trainingNodes as Array,
   labels as Array,
   [options as Object?]
) as Object

Summary

Produces a set of classifiers from a list of labeled training documents.

Parameters

trainingNodes The array of training nodes. These are nodes that represent members of the classes.

labels An array of labels for the training nodes, in the order corresponding to the training nodes.

Parameters
trainingNodes	The array of training nodes. These are nodes that represent members of the classes.
labels	An array of labels for the training nodes, in the order corresponding to the training nodes.
options	Options with which to customize this operation. Specify your options as a JavaScript object, with the option names as the object property names. The following is a sample options object: { "classifierType": "supports", "kernel": "geodesic" } This function supports the following options: `classifierType` A string defining the kind of classifier to produce, either `weights` or `supports`. The default is `weights`. `kernel` A string defining which function to use for comparing documents. The default is `sqrt`. Normalization (the values that end in `-normalized`) brings document vectors into the unit sphere, which may improve the mathematical properties of the calculations. Possible values are: `simple` Model documents as 1 or 0 for presence or absence of each term. `simple-normalized` Like `simple`, but normalized by the square root of the document length. `sqrt` Model documents using the square root of the term frequencies. `sqrt-normalized` Like `sqrt`, but normalized by the sum of the term frequencies. `linear-normalized` Model documents as the term frequencies normalized by the square root of the sum of the squares of the term frequencies. `gaussian` Compare documents using the Gaussian of the term frequencies. Requires a `classifierType` of `supports`. `geodesic` Compare documents using the Riemann geodesic distance over term frequencies. Requires a `classifierType` of `supports`. `maxTerms` An integer defining the maximum number of terms to use to represent each document. If a positive number M is given, then the M most discriminating terms are used; other terms are dropped. The default is 0 (unlimited), but for larger documents a value in 500 to 1000 range will produce much better results. `maxSupport` A double specifying the maximum influence a single training node can have. This parameter has a strong influence on performance. The default value of 1.0 should work well in most cases. Larger values means greater sensitivity and may improve accuracy on small datasets, but give longer running times. Smaller values mean less sensitivity and better resistance to mis-classified documents, and shorter running times. `minWeight` A double specifying the minimum weight a term can have and still be considered for inclusion in the term vector. This parameter only applies to the term weight form of the classifier. Smaller values mean longer term vectors and as a consequence longer running times and greater memory consumption during classification, but may also improve accuracy. The initial value may be adjusted downwards during training if a class would otherwise have no terms in its output vector. The default is is 0.01. `tolerance` How close the final solutions to the constraint equations must be. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The default is 0.01. `epsilon` How close a value must be to 0 to be counted as equal to 0. Since double arithmetic is not precise, setting this value to exactly 0 will likely lead to non-convergence of the algorithm. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The initial value may be adjusted downwards during execution if it is too large to be useful. In general the higher the dimensionality (larger documents, larger limits on the number of terms), the smaller this should be. The default is 0.01. `maxIterations` The maximum number of iterations of the constraint satisfaction algorithm to run. The algorithm usually converges very quickly, so this parameter usually has no effect unless it is set very low. The default is 500. `defaultThreshold`, `classThresholds` A definition of the thresholds to use in classification. You can specify both a default value and per-class values (as computed from `cts.thresholds`). The default value will apply to any classes for which a per-class value is not specified. For example: { ... defaultThreshold: -1.0, classThresholds: {"Example 1": -2.42, "Example 2": 0.41} ... } For the initial tuning phase of training your data, leave the value of this parameter at its default value which is a very large negative number (-1.0e30). This will allow you to accurately compute the threshold values when you run `cts.thresholds` on the initial training data. Then you can use the calculated threshold values when you run the secondary pass through the second part of your training data. `useDbConfig` A boolean value indicating whether to use the current DB configuration for determining which terms to use. The default is `false`, which means that only the indexing options in the options node will be used for calculating the classifier. The options object also includes database indexing options. These control which terms to use. Note that the use of certain options, such as `fastCaseSensitiveSearches`, will not impact final results unless the term vector size is limited with the `maxTerms` option. Other options, such as `phraseThroughs`, will only generate terms if some other option is also enabled (in this case `fastPhraseSearches`). The database options are the same as the database options shown for `cts.distinctiveTerms`.

options

Options with which to customize this operation. Specify your options as a JavaScript object, with the option names as the object property names. The following is a sample options object:


    {
      "classifierType": "supports",
      "kernel": "geodesic"
    }

This function supports the following options:

classifierType

A string defining the kind of classifier to produce, either weights or supports. The default is weights.

kernel

A string defining which function to use for comparing documents. The default is sqrt. Normalization (the values that end in -normalized) brings document vectors into the unit sphere, which may improve the mathematical properties of the calculations. Possible values are:

simple: Model documents as 1 or 0 for presence or absence of each term.
simple-normalized: Like simple, but normalized by the square root of the document length.
sqrt: Model documents using the square root of the term frequencies.
sqrt-normalized: Like sqrt, but normalized by the sum of the term frequencies.
linear-normalized: Model documents as the term frequencies normalized by the square root of the sum of the squares of the term frequencies.
gaussian: Compare documents using the Gaussian of the term frequencies. Requires a classifierType of supports.
geodesic: Compare documents using the Riemann geodesic distance over term frequencies. Requires a classifierType of supports.

maxTerms

An integer defining the maximum number of terms to use to represent each document. If a positive number M is given, then the M most discriminating terms are used; other terms are dropped. The default is 0 (unlimited), but for larger documents a value in 500 to 1000 range will produce much better results.

maxSupport

A double specifying the maximum influence a single training node can have. This parameter has a strong influence on performance. The default value of 1.0 should work well in most cases. Larger values means greater sensitivity and may improve accuracy on small datasets, but give longer running times. Smaller values mean less sensitivity and better resistance to mis-classified documents, and shorter running times.

minWeight

A double specifying the minimum weight a term can have and still be considered for inclusion in the term vector. This parameter only applies to the term weight form of the classifier. Smaller values mean longer term vectors and as a consequence longer running times and greater memory consumption during classification, but may also improve accuracy. The initial value may be adjusted downwards during training if a class would otherwise have no terms in its output vector. The default is is 0.01.

tolerance

How close the final solutions to the constraint equations must be. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The default is 0.01.

epsilon

How close a value must be to 0 to be counted as equal to 0. Since double arithmetic is not precise, setting this value to exactly 0 will likely lead to non-convergence of the algorithm. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The initial value may be adjusted downwards during execution if it is too large to be useful. In general the higher the dimensionality (larger documents, larger limits on the number of terms), the smaller this should be. The default is 0.01.

maxIterations

The maximum number of iterations of the constraint satisfaction algorithm to run. The algorithm usually converges very quickly, so this parameter usually has no effect unless it is set very low. The default is 500.

defaultThreshold, classThresholds

A definition of the thresholds to use in classification. You can specify both a default value and per-class values (as computed from cts.thresholds). The default value will apply to any classes for which a per-class value is not specified. For example:

    {
        ...
        defaultThreshold: -1.0,
        classThresholds: {"Example 1": -2.42, "Example 2": 0.41}
        ...
    }

For the initial tuning phase of training your data, leave the value of this parameter at its default value which is a very large negative number (-1.0e30). This will allow you to accurately compute the threshold values when you run cts.thresholds on the initial training data. Then you can use the calculated threshold values when you run the secondary pass through the second part of your training data.

useDbConfig

A boolean value indicating whether to use the current DB configuration for determining which terms to use. The default is false, which means that only the indexing options in the options node will be used for calculating the classifier.

The options object also includes database indexing options. These control which terms to use. Note that the use of certain options, such as fastCaseSensitiveSearches, will not impact final results unless the term vector size is limited with the maxTerms option. Other options, such as phraseThroughs, will only generate terms if some other option is also enabled (in this case fastPhraseSearches).

The database options are the same as the database options shown for cts.distinctiveTerms.

Usage Notes

The elements in the label sequence should match one for one with the nodes in the training node sequence. The first label element describes the first node in the training node sequence, the second label element describes the second node in the training node sequence, and so on. If there are more labels than training nodes or more training nodes than labels, an error is raised.

The format of each label object is:

{
    "name": "apple doc",
    "classes": [
        {
            "name": "fruit class",
            "val": 1
        },
        {
            "name": "animal class",
            "val": -1
        }
    ]
}

Each class listed indicates whether the corresponding node in the training sequence is in the given class. Examples are taken to be positive examples unless specified otherwise (with a val attribute of -1). The document is assumed to be a negative example of any classes that are not explicitly listed. The name property in the label object is an optional name for the labelled node. It is purely for human consumption to help in tuning the classification parameters.

Output Formats

A linear classifier is defined by a weight vector w on terms, and an offset value b. The weights property encodes the weight vector directly. Its children are the classes, and each class includes a list of terms. The term node uses an internal id to identify the term and a term weight:

"weights":[
  {
    "name":"animal class",
    "offset":0.9609375,
    "terms":[
      {
        "id":"3701029877487003077",
        "val":-0.132582515478134
      },
      {
        "id":"8051590956710175434",
        "val":0.353553384542465
      },
      :            :
    ]
  },
  :                :
]

The weight vector w is a linear combination of the documents themselves, and it may be more convenient to express the classifier in this way. For instance, if the number of terms is not limited, the weights property will be extremely large. The weight vector form may not be used if the classifier kernel is non-linear, that is, with the Gaussian or geodesic kernel.

The support vector representation of the classifier includes a supports property that has class children for each class. Here the class objects contain a list of doc elements which identify the specific training nodes using an internal key. This internal key is valid across queries only for nodes in the database. It is strongly recommended that the training set for supports classifiers consist of whole documents only. Each doc object has a property encoding the weight of that document and an error property which shows how well the document fit the classifier. Large positive or negative errors (greater than about 1.5) are potentially mis-classified documents.

"supports":[
  {
    "name":"animal",
    "offset":0.9609375,
    "docs":[
      {
        "id":"10529665449293922777",
        "name":"apple doc",
        "val":-0.3125,
        "err":0
      },
      {
        "id":"95824053707766723",
        "name":"banana doc",
        "val":-0.375,
        "err":0.0078125
      },
      :            :
    ]
  },
  :                :
]

Each class is identified by a unique name.

Example

var firsthalf = fn.subsequence(xdmp.directory("/shakespeare/plays/", "1"), 1, 19);
var plays = firsthalf.clone();
var labels = [];
for (var x of firsthalf) {
  var singleClass = [{"name": fn.head(xdmp.documentProperties(xdmp.nodeUri(x)))
                                .xpath("//playtype/fn:string()")
                     }];
  labels.push({"classes": singleClass});
};
cts.train(plays.toArray(), labels, 
          {"classifierType": "supports", 
           epsilon: 0.00001}
         );
  =>
{
  "options": {
    "kernel": "sqrt",
    "classifierType": "supports",
    "minWeight": 0.01,
    "maxTerms": 0,
    "maxIterations": 500,
    "maxSupport": 1,
    "tolerance": 0.01,
    "epsilon": 0.00001,
    "defaultThreshold": -1e+30,
    "classThresholds": {}
  },
  "supports": [
    {
      "name": "HISTORY",
      "offset": 0.679854154586792,
      "docs": [
        { "id": "12231438930115319131",
          "val": -0.0000109664215415251,
          "err": 0.00122268195264041
        },
        { "id": "15339507384182411064",
          "val": 0.0000208658457268029,
          "err": -0.00875759869813919
        },
          ...
      ]
    },
    {
      "name": "COMEDY",
      "offset": 0.502409636974335,
      "docs":
      [
        { "id": "12231438930115319131",
          "val": -0.0000158612419909332,
          "err": 0.000878061284311116
        },
        { "id": "17774930858870475928",
          "val": 0.0000244826205744175,
          "err": 0.00316164619289339
        },
          ...
      ]
    },
    {
      "name": "TRAGEDY",
      "offset": -0.179147496819496,
      "docs":
      [
        { "id": "8900580694384751574",
          "val": 0.0000163165386766195,
          "err": 0.00214929808862507
        },
        { "id": "12231438930115319131",
          "val": 0.000026724021154223,
          "err": 0.00388686032965779
        },
          ...
      ]
    }
  ]
}

Example

// This example is the same as the first, except that it uses the 
// useDbConfig option.

var firsthalf = fn.subsequence(xdmp.directory("/shakespeare/plays/", "1"), 1, 19);
var plays = firsthalf.clone();
var labels = [];
for (var x of firsthalf) {
  var singleClass = [{"name": fn.head(xdmp.documentProperties(xdmp.nodeUri(x))).
                      xpath("//playtype/fn:string()")
                     }];
  labels.push({"classes": singleClass});
};
cts.train(plays.toArray(), labels, 
          {"classifierType": "supports",
           "useDbConfig": true,
           "epsilon": 0.00001
          }
         );
=>
{
  "options": {
    "kernel": "sqrt",
    "classifierType": "supports",
    "minWeight": 0.01,
    "maxTerms": 0,
    "maxIterations": 500,
    "maxSupport": 1,
    "tolerance": 0.01,
    "epsilon": 0.00001,
    "defaultThreshold": -1e+30,
    "classThresholds": {
    },
    "useDbConfig": true
  },
  "supports": [
    {
      "name": "HISTORY",
      "offset": 0.616991937160492,
      "docs": [
        { "id": "11719886725627889310",
          "val": 0.000012535679161374,
          "err": 0.00515030510723591
        },
        { "id": "703569506516702025",
          "val": 0.0000126165068650153,
          "err": 3.86468634872017e-13
        },
          ...
      ]
    },
    {
      "name": "COMEDY",
      "offset": 0.444232106208801,
      "docs": [
        { "id": "347003984347788586",
          "val": -0.0000104659011412878,
          "err": -0.00548016233369708
        },
        { "id": "15822004215638450994",
          "val": 0.0000148163953781477,
          "err": -0.00175983365625143
        },
          ...
      ]
    },
    {
      "name": "TRAGEDY",
      "offset": -0.0621433705091477,
      "docs": [
        { "id": "347003984347788586",
          "val": 0.0000174711658473825,
          "err": 0.000306207628455013
        },
        { "id": "15822004215638450994",
          "val": -0.0000100835841294611,
          "err": -0.000551707693375647
        },
          ...
      ]
    }
  ]
}

Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.