Loading TOC...

cts.classify

cts.classify(
   $dataNodes as Array,
   $classifier as Object,
   [$options as Object?],
   [$trainingNodes as Array]
) as Array

Summary

Classifies an array of nodes based on training data. The training data is in the form of a classifier specification, which is generated from the output of cts.train. Returns labels for each of the input documents in the same order as the input document.

Parameters
$dataNodes The array of nodes to be classified.
$classifier An object containing the classifier specification. This is typically the output of cts.train, either run directly or saved in a JSON document in the database.
$options

An options object. The options for classification are passed automatically from cts.train to the cts.classifier specification as part of the classifier object so that they are consistent with the parameters used in training. The following options may be separately passed to cts.classify . These options override the options present in the classifier item-by-item.

defaultThreshold, classThresholds

Definitions of the thresholds to use in classification. classThresholds specify per-class values (as computed from cts.thresholds). defaultThreshold will apply to any classes for which a per-class value is not specified. For example:
    {
        ...
        defaultThreshold: -1.0,
        classThresholds: {"Example 1": -2.42, "Example 2": 0.41}
        ...
    }
    
$trainingNodes The array of training nodes used to train the classifier. Required if the supports form of the classifier is used; ignored if the weights form of the classifier is used.

Usage Notes

cts.classify classifies an array of nodes using the output from cts.train. The dataNodes and classifier parameters are respectively the nodes to be classified and the specification output from cts.train. cts.classify can use either supports or weights forms of the classifier output from cts.train (see Output Formats). If the supports form is used, the training nodes must be passed as the 4th parameter. The options parameter is an options object.

The output is an array of label objects of the form:

Each label corresponds to the data node in the corresponding position in the input sequence. There will be an object for each class where the document passed the class threshold. The val property gives the class membership value for the data node in the given class. Values greater than zero indicate likely class membership, values less than zero indicate likely non-membership. Adjusting thresholds can give more or less selective classification. Increasing the threshold leads to a more selective classification (that is, decreases the likelihood of classification in the class). Decreasing the threshold gives less selective classification.

Example

var firsthalf = fn.subsequence(
  xdmp.directory("/shakespeare/plays/", "1"), 1, 19);
var plays1 = firsthalf.clone();
var secondhalf = fn.subsequence(
  xdmp.directory("/shakespeare/plays/", "1"), 20, 37);
var plays2 = secondhalf.clone();
var labels = [];
for (var x of firsthalf) {
  var singleClass = [{"name": fn.head(xdmp.documentProperties(xdmp.nodeUri(x))).
                      xpath("//playtype/fn:string()")
                     }];
  labels.push({"classes": singleClass});
}
var classifier = cts.train(plays1.toArray(), labels,
                           {"classifierType": "supports",
                            "useDbConfig": true,
                            "epsilon": 0.00001
                           });
cts.classify(plays2.toArray(), classifier, {}, plays1.toArray());
=>
[
  {
    "classes": [
      { "name": "HISTORY",
        "val": 4.29498338699341
      },
      { "name": "COMEDY",
        "val": 2.83974766731262
      },
      { "name": "TRAGEDY",
        "val": -0.454397678375244
      }
    ]
  },
  {
    "classes": [
      { "name": "HISTORY",
        "val": 3.70210886001587
      },
      { "name": "COMEDY",
        "val": 2.59831714630127
      },
      { "name": "TRAGEDY",
        "val": -0.404506534337997
      }
    ]
  },
  ...
]

Comments

  • For Classifier&trainer, what is the ml algorithm it use? And will it use if-idf to pre-deal trainset? In addition, how can I get '/shakespeare/plays/'? U know, i want to know the format of trainset. And can i get the similar documents in db for a new document? Wish you will reply me as soon as possible. Thx!!!
    • You can find more information about the classifier, including a link to the Shakespeare data and information about the support vector machine approach, in the <a href="http://docs.marklogic.com/guide/search-dev/classifier">Training the Classifier</a> chapter of the Search Developer's Guide.
      • Thank you! Some questions not related to this. Could I use MARKLOGIC to do named entity recognization? How to do? And U know, cts.tokenize can get tokens of sentences, can I build a trietree or other methods to recognize this tokens as entities BY MARKLOGIC?
        • MarkLogic doesn't do entity recognition, but we have partners like <a href="http://www.smartlogic.com/">SmartLogic</a> who do.
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy