MarkLogic Server includes an XML support vector machine (SVM) classifier. This chapter describes the classifier and how to use it on your content, and includes the following sections:
The classifier is a set of APIs that allow you to define classes, or categories of nodes. By running samples of classes through the classifier to train it on what constitutes a given class, you can then run that trained classifier on unknown documents or nodes to determine to which classes each belongs. The process of classification uses the full-text indexing capabilities of MarkLogic Server, as well as its XML-awareness, to perform statistical analysis of terms in the training content to determine class membership. This section describes the concepts behind the classifier and includes the following parts:
There are two basic steps to using the classifier: training and classification. Training is the process of taking content that is known to belong to specified classes and creating a classifier on the basis of that known content. Classification is the process of taking a classifier built with such a training content set and running it on unknown content to determine class membership for the unknown content. Training is an iterative process whereby you build the best classifier possible, and classification is a one-time process designed to run on unknown content.
The MarkLogic Server classifier implements a support vector machine (SVM). An SVM classifier uses a well-known algorithm to determine membership in a given class, based on training data. For background on the mathematics behind support vector machine (SVM) classifiers, try doing a web search for svm classifier
, or start by looking at the information on Wikipedia.
The basic idea is that the classifier takes a set of training content representing known examples of classes and, by performing statistical analysis of the training content, uses the knowledge gleaned from the training content to decide to which classes other unknown content belongs. You can use the classifier to gain knowledge about your content based on the statistical analysis performed during training.
Traditional SVM classifiers perform the statistical analysis using term frequency as input to the support vector machine calculations. The MarkLogic XML SVM classifier takes advantage of MarkLogic Server's XML-aware full-text indexing capabilities, so the terms that act as input to the classifier can include content (for example, words), structure information (for example, elements), or a combination of content and structure (for example, element-word relationships). All of the MarkLogic Server index options that affect terms are available as options in the classifier API, so you can use a wide variety of indexing techniques to tune the classifier to work the best for your sample content.
First you define your classes on a set of training content, and then the classifier uses those classes to analyze other content and determine its classification. When the classifier analyzes the content, there are two sometimes conflicting measurements it uses to help determine if the information in the new content belongs in or out of a class:
When you are tuning your classifier, you need to find a balance between high precision and high recall. That balance depends on what your application goals and requirements are. For example, if you are trying to find trends in your content, then high precision is probably more important; you want to ensure that your analysis does not include irrelevant nodes. If you need to identify every instance of some classification, however, you probably need a high recall, as missing any members would go against your application goals. For most applications, you probably need somewhere in between. The process of training your classifier is where you determine the optimal values (based on your training content set) to make the trade-offs that make sense to your application.
There are two main things that the computations behind the XML SVM classifier do:
There can be any number of classes. A term vector is a representation of all of the terms (as defined by the index options) in a node. Therefore, classes consist of sets of term vectors which have been deemed similar enough to belong to the same class.
Imagine for a moment that each term forms a dimension. It is easy to visualize what a 2-dimensional picture of a class looks like (imagine an x-y graph) or even a 3-dimensional picture (imagine a room with height, width, and length). It becomes difficult, however, to visualize what the picture of these dimensions looks like when there are more than three dimensions. That is where hyper-planes become a useful concept.
Before going deeper into the concept of hyper-planes, consider a content set with two classes, one that are squares and one that are triangles. In the following figures, each square or triangle represents a term vector that is a member of either the square or triangle class, respectively.
Now try to draw a line to separate the triangles from the squares. In this case, you can draw such a line that nicely divides the two classes as follows:
If this were three dimensions, instead of a line between the classes it would be a plane between the classes. When the number of dimensions grows beyond three, the extension of the plane is called a hyper-plane; it is the generalized representation of a boundary of a class (sometimes called the edge of a class).
The previous examples are somewhat simplified; they are set up such that the hyper-planes can be drawn such that one class is completely on one side and the other is completely on the other. For most real-world content, there are members of each class on the other side of the boundaries as follows:
In these cases, you can draw other lines parallel to the boundaries (or in the n-dimensional cases, other hyper-planes). These other lines represent the thresholds for the classes. The distance between the boundary line and the threshold line represents the threshold value, which is a negative number indicating how far the outlier members of the class are from the class boundary. The following figure represents these thresholds.
The dotted lines represent some possible thresholds. The lines closer to the boundary represent thresholds with higher precision (but not complete precision), while the lines farther from the boundaries represent higher recall. For members of the triangle class that are on the other side of the square class boundaries, those members are not in the class, but if they are within the threshold you choose, then they are considered part of the class.
One of the classifier APIs (cts:thresholds) helps you find the right thresholds for your training content set so you can get the right balance between precision and recall when you run unknown content against the classifier to determin class membership.
The following figure shows the triangle class boundary, including the precision and recall calculations based on a threshold (the triangle class is below the threshold line):
To find the best thresholds for your content, you need to train the classifier with sample content that represents members of all of the classes. It is very important to find good training samples, as the quality of the training will directly impact the quality of your classification.
The samples for each class should be statistically relevant, and should have samples that include both solid examples of the class (that is, samples that fall well into the positive side of the threshold from the class boundary) and samples that are close to the boundary for the class. The samples close to the boundary are very important, because they help determine the best thresholds for your content. For more details about training sets and setting the threshold, see Creating a Training Set and Methodology For Determining Thresholds For Each Class.
The classifier has three XQuery built-in functions. This section gives an overview and explains some of the features of the API, and includes the following parts:
For details about the syntax and usage of the classifier API, see the MarkLogic XQuery and XSLT Function Reference.
The classifier API includes three XQuery functions:
You use these functions to take training nodes use them to compute classifiers. Creating a classifier specification is an iterative process whereby you create training content, train the classifier (using cts:train) with the training content, test your classifier on some other training content (using cts:classify), compute the thresholds on the training content (using cts:threshold
), and repeat this process until you are satisfied with the results. For details about the syntax and usage of the classifier API, see the MarkLogic XQuery and XSLT Function Reference.
The classifier APIs take nodes and elements, so you can either use XQuery to construct the data for the nodes you are classifying or training, or you can store them in the database (or somewhere else), whichever is more convenient. Because the APIs take nodes as parameters, there is a lot of flexibility in how you store your training and classification data.
There is an exception to this: if you are using the supports
form of the classifier, then the training data must reside in the database, and you must pass in the training nodes when you perform classification (that is, when you run cts:classify) on unknown content.
The classifier API has many options, and is therefore extremely tunable. You can choose the different index options and kernel types for cts:train, as well as specify limits and thresholds. When you change the kernel type for cts:train, it will effect the results you get from classification, as well as effect the performance. Because classification is an iterative process, experimentation with your own content set tends to help get better results from the classifier. You might change some parameters during different iterations and see which gives the better classification for your content.
The following section describes the differences between the supports and weights forms of the classifier. For details on what each option of the classifier does, see the MarkLogic XQuery and XSLT Function Reference.
There are two forms of the classifier:
supports
: allows the use of some of the more sophisticated kernels. It encodes the classifier by reference to specific documents in the training set, and is therefore more accurate because the whole training document can be used for classification; however, that means that the whole training set must be available during classification, and it must be stored in the database. Furthermore, since constructing a term vector is exactly equivalent to indexing, each time the classifier is invoked it regenerates the index terms for the whole training set. On the other hand, the actual representation of the classifier (the XML returned from cts:train) may be a lot more compact. The other advantage of the supports
form of the classifier is that it can give you error estimates for specific training documents, which may be a sign that those are misclassified or that other parameters are not set to optimal values.weights
: encodes weights for each of the terms. For mathematical reasons, it cannot be used with the Gaussian or Geodesic kernels, although for many problems, those kernels give the best results. Since there will not be a weight for every term in training set (because of term compression), this form of the classifier is intrinsically less precise. If there are a lot of classes and a lot of terms, the classifier representation itself can get quite large. However, there is no need to have the training set on hand during classification, nor to construct term vectors from it (in essence to regenerate the index terms), so cts:classify runs much faster with the weights
form of the classifier. Which one you choose depends on your answers to several questions and criteria, such as performance (does the supports
form take too much time and resources for your data?), accuracy (are you happy with the results you get with the weights
form with your data?), and other factors you might encounter while experimenting with the different forms. In general, the classifier is extremely tunable, and getting the best results for your data will be an iterative process, both on what you use for training data and what options you use in your classification.
You can choose different kernels during the training phase. The kernels are mapping functions, and they are used to determine the distance of a term vector from the edge of the class. For a description of each of the kernel mapping functions, see the documentation for cts:train
in the MarkLogic XQuery and XSLT Function Reference.
As part of the iterative nature of training to create a classifier specification, one of the overriding goals is to find the best threshold values for your classes and your content set. Ideally, you want to find thresholds that strike a balance between good precision and good recall (for details on precision and recall, see XML SVM Classifier). You use the cts:thresholds function to calculate the thresholds based on a training set. For an overview of the iterative process of finding the right thresholds, see Methodology For Determining Thresholds For Each Class.
Because the classifier operates from an XQuery context, and because it is built into MarkLogic Server, it is intrinsically XML-aware. This has many advantages. You can choose to classify based on a particular element or element hierarchy (or even a more complicated XML construct), and then use that classifier against either other like elements or element hierarchies, or even against a totally different set of element or element hierarchies. You can perform XML-based searches to find the best training data. If you have built XML structure into your content, you can leverage that structure with the classifier.
For example, if you have a set of articles that you want to classify, you can classify against only the <executive-summary>
section of the articles, which can help to exclude references to other content sections, and which might have a more universal style and language than the more detailed sections of the articles. This approach might result in using terms that are highly relevant to the topic of each article for determining class membership.
This section describes the training content set you use to create a classifier, and includes the following parts:
The quality of your classification can only be as good as the training set you use to run the classifier. It is extremely important to choose sample training nodes that not only represent obvious examples of a class, but also samples which represent edge cases that belong in or out of a class.
Because the process of classification is about determining the edges of the classes, having good samples that are close to this edge is important. You cannot always determine what constitutes an edge sample, though, by examining the training sample. It is therefore good practice to get as many different kinds of samples in the training set as possible.
As part of the process of training the classifier, you might need to add more samples, verify that the samples are actually good samples, or even take some samples away (if they turn out to be poor samples) from some classes. Also, you can specify negative samples for a class. It is an iterative process of finding the right training data and setting the various training options until you end up with a classifier that works well for your data.
The second parameter to cts:train is a label specification, which is a sequence of cts:label
elements, each one having a one cts:class
child. Each cts:label
element represents a node in the training set. The cts:label
elements must be in the order corresponding to the specified training nodes, and they each specify to which class the corresponding training node belongs. For example, the following cts:label
nodes specifies that the first training node is in the class comedy
, the second in the class tragedy
, and the third in the class history
:
<cts:label> <cts:class name="comedy"/> </cts:label> <cts:label> <cts:class name="tragedy"/> </cts:label> <cts:label> <cts:class name="history"/> </cts:label>
Because the labels must be in the order corresponding to the training nodes, you might find it convenient to generate the labels from the training nodes. For example, the following code extracts the class name for the labels from a property names playtype
stored in the property corresponding to the training nodes:
for $play in xdmp:directory("/plays/", "1") return <cts:label> <cts:class name="{ xdmp:document-property(xdmp:node-uri($play))//playtype/text()}"/> </cts:label>
If you have training samples that represent negative samples for a class (that is, they are examples of what does not belong in the class), you can label them such by specifying the val="-1"
attribute on the cts:class
element as follows:
<cts:class name="comedy" val="-1"/>
Additionally, you can include multiple classes in a label (because membership in one class is independent of membership in another). For example:
<cts:label> <cts:class name="comedy" val="-1"/> <cts:class name="tragedy"/> <cts:class name="history"/> </cts:label>
Use the following methodology to determine appropriate per-class thresholds for classification:
thresholds
option for this run. The default value is a very large negative number, so this run will measure the distance from the actual class boundary for each node in the training set. Any time you pass thresholds to cts:train, the thresholds apply to cts:classify. You can pass them either with cts:train or cts:classify, though, and the effect is the same.
This section describes the steps needed to train the classifier against a content set of the plays of William Shakespeare. This is meant is a simple example for illustrating how to use the classifier, not necessarily as an example of the best results you can get out of the classifier. The steps are divided into the following parts:
When you are creating a classifier, the first step is to choose some training content. In this example, we will use the plays of William Shakespeare as the training set from which to create a classifier.
The Shakespeare plays are available in XML at the following URL (subject to the copyright restrictions stated in the plays):
http://www.oasis-open.org/cover/bosakShakespeare200.html
This example assumes the plays are loaded into a MarkLogic Server database under the directory /shakespeare/plays/
. There are 37 plays.
After deciding on the training set, the next step is to choose classes in which you divide the set, as well as choosing labels for those classes. For Shakespeare, the classes are COMEDY
, TRAGEDY
, and HISTORY
. You must decide which plays belong to each class. To determine which Shakespeare plays are comedies, tragedies, and histories, consult your favorite Shakespeare scholars (there is reasonable, but not complete agreement about which plays belong in which classes).
For convenience, we will store the classes in the properties document at each play URI. To create the properties for each document, perform something similar to the following for each play (inserting the appropriate class as the property value):
xdmp:document-set-properties("/shakespeare/plays/hamlet.xml", <playtype>TRAGEDY</playtype>)
For details on properties in MarkLogic Server, see Properties Documents and Directories in the Application Developer's Guide.
Next, we will divide the training set into two parts, where we know the class of each node in both parts. We will use the first part to train and the second part to validate the classifier built from the first half of the training set. The two parts should be statistically random, and to do that we will simply take the first half in the order that the documents return from the xdmp:directory
call. You can choose a more sophisticated randomization technique if you like.
As we are taking the first half of the play for the training content, we will need labels for each node (in this example, we are using the document node for each play as the training nodes). To create the labels on the first half of the content, run a query statement similar to the following:
for $x in xdmp:directory("/shakespeare/plays/", "1")[1 to 19] return <cts:label> <cts:class name="{xdmp:document-properties(xdmp:node-uri($x)) //playtype/text()}"/> </cts:label>
For simplicity, this example uses the first 19 items of the content set as the training nodes. The samples you use should use a statistically random sample of the content for the training set, so you might want to use a slightly more complicated method (that is, one that ensures randomness) for choosing the training set.
Next, you run cts:train with your training content and labels. The following code constructs the labels and runs cts:train to generate a classifier specification:
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19] let $labels := for $x in $firsthalf return <cts:label> <cts:class name="{xdmp:document-properties(xdmp:node-uri($x)) //playtype/text()}"/> </cts:label> return cts:train($firsthalf, $labels, <options xmlns="cts:train"> <classifier-type>supports</classifier-type> </options>)
You can either save the generated classifier specification in a document in the database or run this code dynamically in the next step.
Next, you take the classifier specification created with the first half of the training set and run cts:classify on the second half of the content set, as follows:
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19] let $secondhalf := xdmp:directory("/shakespeare/plays/", "1")[20 to 37] let $classifier := let $labels := for $x in $firsthalf return <cts:label> <cts:class name="{xdmp:document-properties(xdmp:node-uri($x)) //@name}"/> </cts:label> return cts:train($firsthalf, $labels, <options xmlns="cts:train"> <classifier-type>supports</classifier-type> </options>) return cts:classify($secondhalf, $classifier, <options xmlns="cts:classify"/>, $firsthalf)
Next, calculate cts:label
elements for the second half of the content and use it to compute the thresholds to use with the classifier. The following code runs cts:train and cts:classify again for clarity, although the output of each could be stored in a document.
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19] let $secondhalf := xdmp:directory("/shakespeare/plays/", "1")[20 to 37] let $firstlabels := for $x in $firsthalf return <cts:label> <cts:class name="{xdmp:document-properties(xdmp:node-uri($x)) //playtype/text()}"/> </cts:label> let $secondlabels := for $x in $secondhalf return <cts:label> <cts:class name="{xdmp:document-properties(xdmp:node-uri($x)) //playtype/text()}"/> </cts:label> let $classifier := cts:train($firsthalf, $firstlabels, <options xmlns="cts:train"> <classifier-type>supports</classifier-type> </options>) let $classifysecond := cts:classify($secondhalf, $classifier, <options xmlns="cts:classify"/>, $firsthalf) return cts:thresholds($classifysecond, $secondlabels)
This produces output similar to the following:
<thresholds xmlns="http://marklogic.com/cts"> <class name="TRAGEDY" threshold="-0.00215207" precision="1" recall="0.666667" f="0.8" count="3"/> <class name="COMEDY" threshold="0.216902" precision="0.916667" recall="1" f="0.956522" count="11"/> <class name="HISTORY" threshold="0.567648" precision="1" recall="1" f="1" count="4"/> </thresholds>
Finally, you can analyze the results from cts:thresholds. As an ideal, the thresholds should be zero. In practice, a negative number relatively close to zero makes a good threshold. The threshold for tragedy above is quite good, but the thresholds for the other classes are not quite as good. If you want the thresholds to be better, you can try running everything again with different parameters for the kernel, for the indexing options, and so on. Also, you can change your training data (to try and find better examples of comedy, for example).
Once you are satisfied with your classifier, you can run it on other content. For example, you can try running it on SPEECH elements in the shakespeare plays, or try it on plays by other playwrights.