Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 5

Relevance Scores: Understanding and Customizing

Search results in MarkLogic Server return in relevance order; that is, the result that is most relevant to the cts:query expression in the search is the first item in the search return sequence, and the least relevant is the last. There are several tools available to control the relevance score associated with a search result item. This chapter describes the different methods available to calculate relevance, and includes the following sections:

Understanding How Scores and Relevance are Calculated

When you perform a cts:search operation, MarkLogic Server produces a result set that includes items matching the cts:query expression and, for each matching item, a score. The score is a number that is calculated based on statistical information, including the number of documents in a database, the frequency in which the search terms appear in the database, and the frequency in which the search term appears in the document. The relevance of a returned search item is determined based on its score compared with other scores in the result set, where items with higher scores are deemed to be more relevant to the search. By default, search results are returned in relevance order, so changing the scores can change the order in which search results are returned.

As part of a cts:search expression, you can specify the following different methods for calculating the score, each of which uses a different formula in its score calculation:

log(tf)*idf Calculation

The logtfidf method of relevance calculation is the default relevance calculation, and it is the option score-logtfidf of cts:search. The logtfidf method takes into account term frequency (how often a term occurs in a single fragment) and document frequency (in how many documents does the term occur) when calculating the score. Most search engines use a relevance formula that is derived by some computation that takes into account term frequency and document frequency.

The logtfidf method (the default scoring method) uses the following formula to calculate relevance:

log(term frequency) * (inverse document frequency)

The term frequency is a normalized number representing how many terms are in a document. The term frequency is normalized to take into account the size of the document, so that a word that occurs 10 times in a 100 word document will get a higher score than a word that occurs 100 times in a 1,000 word document.

The inverse document frequency is defined as:

log(1/df)

where df (document frequency) is the number of documents in which the term occurs.

For most search-engine style relevance calculations, the score-logtfidf method provides the most meaningful relevance scores. Inverse document frequency (IDF) provides a measurement of how 'information rich' a document is. For example, a search for 'the' or 'dog' would probably put more emphasis on the occurences of the term 'dog' than of the term 'the'.

log(tf) Calculation

The option score-logtf for cts:search computes scores using the logtf method, which does not take into account how many documents have the term. The logtf method uses the following formula to calculate scores:

log(term frequency)

where the term frequency is a normalized number representing how many terms are in a document. The term frequency is normalized to take into account the size of the document, so that a word that occurs 10 times in a 100 word document will get a higher score than a word that occurs 100 times in a 1,000 word document.

When you use the logtf method, scores are based entirely on how many times a document matches the search term, and does not take into account the 'information richness' of the search terms.

Simple Term Match Calculation

The option score-simple on cts:search performs a simple term-match calculation to compute the scores. The score-simple method gives a score of 8*weight for each matching term in the cts:query expression, and then scales the score up by multiplying by 256. It does not matter how many times a given term matches (that is, the term frequency does not matter); each match contributes 8*weight to the score. For example, the following query (assume the default weight of 1) would give a score of 8*256=2048 for any fragment with one or more matches for 'hello', a score of 16*256=4096 for any fragment that also has one or more matches for 'goodbye', or a score of zero for fragments that have no matches for either term:

cts:or-query(("hello", "goodbye"))

Use this option if you want the scores to only reflect whether a document matches terms in the query, and you do not want the score to be relative to frequency or 'information-richness' of the term.

Random Score Calculation

The option score-random on cts:search computes a randomly-generated score for each search match. You can use this to randomly choose fragments matching a query. If you perform the same search multiple times using the score-random option, you will get different ordering each time (because the scores are randomly generated at runtime for each search).

Term Frequency Normalization

The scoring methods that take into account term frequency (score-logtfidf and score-logtf) will, by default, normalize the term frequency (how many search term matches there are for a document) based on the size of the document. The idea of this normalization is to take into account how frequent a term occurs in the document, relative to the other documents in the database. You can think of this is the density of terms in a document, as opposed to simply the frequency of the terms. The term frequency normalization makes a document that has, for example, 10 occurrences of the word "dog" in a 10,000,000 word document have a lower relevance than a document that has 10 occurrences of the word "dog" in a 100 words document. With the default term frequency normalization of scaled-log, the smaller document would have a higher score (and therefore be more relevant to the search), because it has a greater 'term density' of the word "dog". For most search applications, this behavior is desirable.

If you would like to change that behavior, you can set the tf normalization option on the database configuration to lessen or eliminate the effects of the size of the matching document in the score calculation, which in turn would strengthen the effect of its term frequency (the number of matches in that document). The unscaled-log option does no scaling based on document size, and the scaled-log option (the default) does the maximum scaling of the document based on document size. Additionally, there are four intermediate settings, weakest-scaled-log, weakly-scaled-log, moderately-scaled-log, and strongly-scaled-log, which have increasing degrees of scaling in between none and the most scaling. If you change this setting in the database and reindexer enable is set to true, then the database will begin reindexing.

How Fragmentation and Index Options Influence Scores

Scores are calculated based on index data, and therefore based on unfiltered searches. That has several implications to scores:

  • Scores are fragment-based, so term frequency and document frequency are calculated based on term frequency per fragment and fragment frequency respectively.
  • Scores are based on unfiltered searches, so they include false-positive results.

Because scores are based on fragments and unfiltered searches, index options will affect scores, and in some case will make the scores more 'accurate'; that is, base the scores on searches that return fewer false-positive results. For example, if you have word positions enabled in the database configuration, searches for three or more term phrases will have fewer false-positive matches, thereby improving the accuracy of the scores.

For details on unfiltered searches and how you can tell if there are false-positive matches, see 'Using Unfiltered Searches for Fast Pagination' in the Query Performance and Tuning Guide.

Using Weights to Influence Scores

Use a weight in a query sub-expression to either boost or lower the sub-expression contribution to the relevance score.

For example, you can specify weights for leaf-level cts:query constructors, such as cts:word-query and cts:element-value-query; for details, see XQuery and XSLT Reference Guide. You can also specify weights in the equivalent Search API abstractions, such as the structured query constructs value-query and word-constraint-query, or when defining a word or value constraint in query options.

The default weight is 1.0. Use the following guidelines for choosing custom weights:

  • To boost the score contribution, set the weight higher than 1.0.
  • To lower the score contribution, set the weight between 0 and 1.0.
  • To contribute nothing to the score, set the weight to 0.
  • To make the score contribution negative, set the weight to a negative number.

Scores are normalized, so a weight is not an absolute multiplier on the score. Instead, weights indicate how much terms from a given query sub-expression are weighted in comparison to other sub-expressions in the same expression. A weight of 2.0 doubles the contribution to the score for terms that match that query. Similarly, a weight of 0.5 halves the contribution to the score for terms that match that query. In some cases, the score reaches a maximum, so a weight of 2.0 and a weight of 20,000 can yield the same contribution to the score.

Adding weights is particularly useful if you have several components in a query expression, and you want matches for some parts of the expression to be weighted more heavily than other parts. For an example of this, see Increase the Score for some Terms, Decrease for Others.

Proximity Boosting With the distance-weight Option

If you have the word positions indexing option enabled in your database, you can use the distance-weight option to the leaf-level cts:query constructors, and then all of the terms passed into that cts:query constructors will consider the proximity of the terms to each other for the purposes of scoring. This proximity boosting will make documents with matches close together have higher scores. Because search results are sorted by score, it will have the effect of making documents having the search terms close together have higher relevance ranking. This section provides some examples that use the distance-weight option along with explanations of the examples, and includes the following parts:

Example of Simple Proximity Boosting

The distance weight is only applied to the matches for cts:query constructors in which the distance-weight occurs. For example, consider the following cts:query constructor:

cts:word-query(("cat", "dog")), "distance-weight=3")

If one document has an instance of "cat" very near "dog", and another document has the same number of "cat" and "dog" terms, but they are not very near, then the one with the "cat" near "dog" will have a higher score.

For example, consider the following:

xquery version "1.0-ml";
(: make sure word positions are enabled in the database :)
(: 
   create 3 documents, then run two searches, one with
   distance-weight and one without, printing out the scores
:)
xdmp:document-insert("/2.xml", 
  <p>The cat is pretty near a dog.</p>) ;

xdmp:document-insert("/1.xml", 
  <p>The cat dog is very near.</p>) ;

xdmp:document-insert("/3.xml", 
  <p>The cat is not very near the very large dog.</p>) ;

for $x in (cts:search(fn:doc(), cts:word-query(("cat", "dog") , 
                               "distance-weight=3" ) ),
           cts:search(fn:doc(), cts:word-query(("cat", "dog") ) ) )
return
element hit{attribute uri {xdmp:node-uri($x)}, 
            attribute score {cts:score($x)},
            attribute text{fn:string($x/p)}}

This returns the following results:

<hit uri="/1.xml" score="146" text="The cat dog is very near."/>
<hit uri="/2.xml" score="140" text="The cat is pretty near a dog."/>
<hit uri="/3.xml" score="135" 
     text="The cat is not very near the very large dog."/>
<hit uri="/3.xml" score="72" 
     text="The cat is not very near the very large dog."/>
<hit uri="/2.xml" score="72" text="The cat is pretty near a dog."/>
<hit uri="/1.xml" score="72" text="The cat dog is very near."/>

Notice that the first three hits use the distance-weight, and the ones with the terms closer together have higher scores, and thus rank higher in the search. The last three hits have the same score because they all have the same number of each term in the cts:query and there is no proximity taken into account in the scores.

Using Proximity Boosting With cts:and-query Semantics

Because the distance-weight option applies to the terms in individual cts:query constructors, the terms are combined as an or-query (that is, any term match is a match for the query). Therefore, the example above would also return results for documents that contain "cat" and not "dog" and vice versa. If you want to have and-query semantics (that is, all terms must match for the query to match) and also have proximity boosting, you will have to construct a cts:query that does an and of all of the terms in addition to the cts:query with the distance-weight option.

For example:

xquery version "1.0-ml";
cts:search(fn:doc(), cts:and-query((
                       cts:word-query("cat"),
                       cts:word-query("dog"),
                       cts:word-query(("cat", "dog") , 
                               "distance-weight=3" ) )) )

The difference between this query and the previous one is that the previous one would return a document that contained "cat" but not "dog" (or vice versa), and this one will only return documents containing both "cat" and "dog".

If you have a large corpus of documents and you expect to have many matches for your searches, then you might find you do not need to use the cts:and-query approach. The reason a large corpus has an effect is because document frequency is taken into account in the relevance calculation, as described in Understanding How Scores and Relevance are Calculated. You might find that the most relevant documents still float to the top of your search even without the cts:and-query. What you do will depend on your application requirements, your preferences, and your data.

Using cts:near-query to Achieve Proximity Boosting

Another technique that makes results with closer proximity have higher scores is to use cts:near-query. Searches that use the cts:near-query constructor will take proximity into account when calculating scores, as long as the word positions index option is enabled in the database. Additionally, you can use the distance-weight parameter to further boost the effect of proximity on scoring.

Because cts:near-query takes a distance argument, you have to think about how near you want results to be in order for them to match. With the distance parameter to cts:near-query, there is a tradeoff between the size of the distance and performance. The higher the number for the distance, the more work MarkLogic Server does to resolve the query. For many queries, this amount of work might be very small, but for some complex queries it can be noticeable.

To construct a query that uses cts:near-query for proximity boosting, pass the cts:query for your search as the first parameter to a cts:near-query, and optionally add a distance-weight parameter to further boost the proximity. The cts:near-query matches will always take distance into account, but setting a distance-weight will further boost the proximity weight. For example, consider how the following query, which uses the same data as the above examples, produces similar results:

xquery version "1.0-ml";
cts:search(fn:doc(), 
   cts:near-query(
     cts:and-query((
        cts:word-query("cat"),
        cts:word-query("dog") )), 
     1000, (), 3) )

This query uses a distance of 1,000, therefore documents that have "cat" and "dog" that are more than 1,000 words apart are not included in its result. The size you use is dependent on your data and the performance characteristics of your searches. If you were more concerned about missing document where the matches are more than 1,000 words away, then you should raise that number; if you are seeing performance issues and want faster performance, and you are OK with missing results that are above the distance threshold (which are probably not relevant anyway), then you should make the number smaller. For databases with a large amount of documents, keep in mind that not returning the documents with words that are far apart from each other will probably result in very similar search results, especially for the most relevant hits (because the results with the matches far apart have low relevance scores compared to the ones that have matches close together).

Interaction of Score and Quality

Each document contains a quality value, and is set either at load time or with xdmp:document-set-quality. You can use the optional $QualityWeight parameter to cts:search to force document quality to have an impact on scores. The scores are then determined by the following formula:

Score = Score + (QualityWeight * Quality)

The default of QualityWeight is 1.0 and the default quality on a document is 0, so by default, documents without any quality set have no quality impact on score. Documents that do have quality set, however, will have impact on the scores by default (because the default QualityWeight is 1, effectively boosting the score by the document quality).

If you want quality to have a smaller impact on the score, set the QualityWeight between 0 and 1.0. If you want the quality to have no impact on the score, set the QualityWeight to 0. If you want the quality to have a larger impact on raising the score, set the QualityWeight to a number greater than 1.0. If you want the quality to have a negative effect on scores, set the QualityWeight to a negative number or set document quality to a negative number.

If you set document quality to a negative number and if you set QualityWeight to a negative number, it will boost the score with a positive number.

Using cts:score, cts:confidence, and cts:fitness

You can get the score for a result node by calling cts:score on that node. The score is a number, where higher numbers indicate higher relevance for that particular result set.

Similarly, you can get the confidence by calling cts:confidence on a result node. The confidence is a number (of type xs:float) between 0.0 and 1.0. The confidence number does not include any quality settings that might be on the document. Confidence scores are calculated by first bounding the scores between 0 and 1.0, and then taking the square root of the bounded number.

As an alternate to cts:confidence, you can get the fitness by calling cts:fitness on a result node. The fitness is a number (of type xs:float) between 0.0 and 1.0. The fitness number does not include any quality settings that might be on the document, and it does not use document frequency in the calculation. Therefore, cts:fitness returns a number indicating how well the returned node satisfies the query issued, which is subtly different from relevance, because it does not take into account other documents in the database.

Relevance Order in cts:search Versus Document Order in XPath

When understanding the order an expression returns in, there are two main rules to consider:

  • cts:search expressions always return in relevance order (the most relevant to the least relevant).
  • XPath expressions always return in document order.

A subtlety to note about these rules is that if a cts:search expression is followed by some XPath steps, it turns the expression into an XPath expression and the results are therefore returned in document order. For example, consider the following query:

cts:search(fn:doc(), "my search phrase")

This returns a relevance-ordered sequence of document nodes that contain the specified phrase. You can get the scores of each node by using cts:score. Things will change if you then add an XPath step to the expression as follows:

cts:search(fn:doc(), "my search phrase")//TITLE

This will now return a document-ordered sequence of TITLE elements. Also, in order to compute the answer to this query, MarkLogic Server must first perform the search, and then reorder the search in document order to resolve the XPath expression. If you need to perform this type of query, it is usually more efficient (and often much more efficient) to use cts:contains in an XPath predicate as follows:

fn:doc()[cts:contains(., "my search phrase")]//TITLE

In most cases, this form of the query (all XPath expression) will be much more efficient than the previous form (with the XPath step after the cts:search expression). There might be some cases, however, where it might be less efficient, especially if the query is highly selective (does not match many fragments).

When you write queries as XPath expressions, MarkLogic Server does not compute scores, so if you need scores, you will need to use a cts:search expression. Also, if you need a query like the above examples but need the results in relevance order, then you can put the search in a FLWOR expression as follows:

for $x in cts:search(fn:doc(), "my search phrase")
return
$x//TITLE

This is more efficient than the cts:search with an XPath step following it, and returns relevance-ranked and scored results.

Sample cts:search Expressions

This section lists several cts:search expressions that include weight and/or quality parameters. It includes the following examples:

Magnify the Score Boost for Documents With Quality

The following search will make any documents that have a quality set (set either at load time or with xdmp:document-set-quality) give much higher scores than documents with no quality set.

cts:search(fn:doc(), cts:word-query("my phrase"), (), 3.0)

For any documents that have a quality set to a negative number less than -1.0, this search will have the effect of lowering the score drastically for matches on those documents.

Increase the Score for some Terms, Decrease for Others

The following search will boost the scores for documents that satisfy one query while decreasing the scores for documents that satisfy another query.

cts:search(fn:doc(), cts:and-query((
  cts:word-query("alfa", (), 2.0), cts:word-query("lada", (), 0.5)
  )) )

This search will boost the scores for documents that contain the word alfa while lowering the scores for document that contain the word lada. For documents that contain both terms, the component of the score from the word alfa is boosted while the component of the score from the word lada is lowered.

« Previous chapter
Next chapter »