cts:thresholds

cts:thresholds(
   $computed-labels as element(cts:label)*,
   $known-labels as element(cts:label)*,
   [$recall-weight as xs:double?]
) as element(cts:thresholds)?

Summary

Compute precision, recall, the F measure, and thresholds for the classes computed by the classifier, by comparing with the labels for the same set.

Parameters
$computed-labels A sequence of element nodes containing the labels from classification (the output from cts:classify) for a set of documents.
$known-labels A sequence of element nodes containing the known labels for the same set of documents.
$recall-weight The factor to use in the calculation of the F measure. The number should be non-negative. A value of 0 means F is just precision and a value of +INF means F is just recall. The default is 1, which gives the harmonic mean between precision and recall.

Usage Notes

You use the output of cts:thresholds to determine the best thresholds values for your data, based on the first pass through the first part of your training data. The output of cts:thresholds provides you with precision and recall measurements at the calculated thresholds for each class. The following are the definitions of the attributes of the thresholds element returned by cts:thresholds :

name

The name of the class.

threshold

The threshold that is computed by the classifier to give the best results. The threshold is used by cts:classify when classifying documents, and is defined to be the positive or negative distance from the hyperplane which represents the edge of the class.

precision

A number which represents the fraction of nodes identified in a class that are actually in that class. As this approaches 1, there is a higher probability that you over-classified.

recall

A number which represents the fraction of nodes in a class that were identified by the classifier as being in that class. As this approaches 1, there is a higher probability that you under-classified.

F (the F-measure)

A measure which represents if the classification at the given threshold is closer to recall or closer to precision. A value of 1 indicates that precision and recall have equal weight. A value of 0.5 indicates that precision is weighted 2x recall. A value of 2 indicates that recall is weighted 2x precision. A value of 0 indicates that the weighting is precision only, and a value of +INF (xs:double('+INF')) indicates that weighting is recall only.

Example

let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $secondhalf := xdmp:directory("/shakespeare/plays/", "1")[20 to 37]
let $firstlabels := for $x in $firsthalf
        return
        <cts:label>
          <cts:class name="{xdmp:document-properties(xdmp:node-uri($x))
                                     //playtype/fn:string()}"/>
        </cts:label>
let $secondlabels := for $x in $secondhalf
        return
        <cts:label>
          <cts:class name={xdmp:document-properties(xdmp:node-uri($x))
                                     //playtype/fn:string()}/>
        </cts:label>
let $classifier :=
    cts:train($firsthalf, $firstlabels,
      <options xmlns="cts:train">
        <classifier-type>supports</classifier-type>
      </options>)
let $classifysecond :=
  cts:classify($secondhalf, $classifier,
        <options xmlns="cts:classify"/>,
        $firsthalf)
return
cts:thresholds($classifysecond, $secondlabels)
(:
   This returns the computed thresholds for the second half of
   the plays in a Shakespeare database, based on a classifier
   trained with the first half of the plays.  For example:

<thresholds xmlns="http://marklogic.com/cts">
  <class name="TRAGEDY" threshold="0.221948" precision="1"
         recall="0.666667" f="0.8" count="3"/>
  <class name="COMEDY" threshold="0.114389" precision="0.916667"
         recall="1" f="0.956522" count="11"/>
  <class name="HISTORY" threshold="0.567648" precision="1"
         recall="1" f="1" count="4"/>
</thresholds>
:)
Powered by MarkLogic Server | Terms of Use | Privacy Policy