Query Performance and Tuning Guide — Chapter 2

Fast Pagination and Unfiltered Searches

MarkLogic Server provides a powerful XML-enabled search capability through the cts:search XQuery built-in function. As part of the search capability, MarkLogic Server allows you to issue cts:search expressions that return results directly from the indexes, without performing the filtering necessary to ensure there are no false-positive results. This chapter describes the search process, including unfiltered searches, and includes the following sections:

Understanding the Search Process
Understanding Unfiltered Searches
Using Unfiltered Searches for Fast Pagination
Example: Determining the Number of False-Positive Matches

Understanding the Search Process

When evaluating cts:search expressions (and also when resolving XPath expressions within XQuery code), MarkLogic Server performs a two-step process.

A list of candidate fragment IDs is generated directly from the indexes, based on the index-resolvable criteria incorporated in the various parameters passed to cts:search. Fragment IDs are ordered according to relevance criteria. This step is called index resolution.
The candidate fragment IDs are used to load fragments from disk. Each fragment is then examined in order, using the complete criteria incorporated in the various parameters passed to cts:search, to determine if the fragment contains zero, one, or more than one result that matches the given cts:search expression. This step is called filtering.

The purpose of index resolution is to narrow the set of candidate fragments to as small a set as possible, without missing any. In some circumstances, the index resolution step can yield a precisely correct set of candidate fragments, rendering the filtering step redundant. In other circumstances, index resolution can reduce the set of candidate fragments somewhat, but in the candidate fragment list there are still false-positive results (that is, candidate fragments that in fact contain no matching results). In still other circumstances, the candidate fragment list can contain fragments that contain more than one matching result.

To better understand false-positive results, imagine a database configuration which has not enabled fast case-sensitive indexes (fast case sensitive searches on the database configuration page). This means that the full-text indexes only maintain direct lookups for words independent of their case. In this scenario, if you are searching for "Dog", the indexes can only tell you what fragments contain the word "dog", in any case-combination of text (for example, "dog", "DOG", "Dog", "doG", and so on). So when index resolution generates a candidate fragment list for "Dog", that list could include a fragment that has the word "dog" but not the word "Dog". This is where filtering comes in--by loading that fragment and examining it prior to returning it as a match, MarkLogic Server is able to determine that it is not in fact a match for the specified query, and rule it out. The fragment is a false-positive result, and should not be returned to the query.

To understand how a candidate fragment can contain more than one match, consider a single-fragment document that contains multiple <author> elements as follows:

<author>Bruce Smith</author>
<author>Betty Smith</author>
<author>Gordon Blair</author>

Now consider the following query:

cts:search(//author, "Smith")

During index resolution, this query generates the fragment for that document as a single candidate fragment. In fact, that single document should generate two results--one for each of Bruce and Betty Smith. During the filtering step, MarkLogic Server identifies that there is more than one element in this document that matches the search requirements, and returns both of the first two <author> elements as results.

As you can see from these examples, the combination of index resolution and filtering combine to provide both performance and accuracy. The algorithm is designed to allow you to write complex queries, and have MarkLogic Server determine the most efficient path providing accurate results.

There are disadvantages to the algorithm, however. Sometimes, you might know better than the search engine, and through careful design of your XML and your fragmentation, you might know that filtering is simply unnecessary. In this case, filtering takes unneeded cycles. In another situation, if you want to jump deep into a result set (for example, retrieving the 1,000,000th result from a really large result set), the emphasis MarkLogic Server has on accuracy through filtering might provide an impractical constraint for your application, because filtering the first 999,999 results will take far too long. Furthermore, it might not matter to your application if, when you jump to the 1,000,000th result, you actually end up at approximately that result (even if in reality it is the 949,237th result).

Consequently, MarkLogic Server provides you with tools to influence the evaluation of cts:search expressions, indicating whether or not filtering is required.

Understanding Unfiltered Searches

An unfiltered search omits the filtering step, which validates whether each candidate fragment result actually meets the search criteria. Unfiltered searches, therefore, are guaranteed to be fast, while filtered searches are guaranteed to be accurate. By default, searches are filtered; you must specify the "unfiltered" option to cts:search to return an unfiltered search. Unfiltered searches have the following characteristics:

They determine the results directly from the indexes, without filtering for validation. This makes unfiltered results most comparable to traditional search-engine style results.
They include false-positive results. False-positive results can originate from a number of situations, including phrase searches containing 3 or more words, certain wildcard searches, punctuation-sensitive, diacritic-sensitive, and/or case-sensitive searches.
They will be significantly affected by fragmentation policy.

The following are some useful guidelines for when to use unfiltered searches:

You should only perform unfiltered searches on top-level nodes or on fragment roots, otherwise you might get unexpected answers. This is because, for queries below the fragment level, there is no guarantee that a particular unfiltered search even matches the query (that is, there is a match somewhere in the fragment, but not necessarily a match in the node you are searching).
If you choose to specify an XPath other than a top-level node or a fragment root, your XPath expression will give correct results if there is only one possible node to match in each fragment (or if the only possible match is in the first node specified). This is because unfiltered searches stop after encountering the first node per fragment that matches the specified XPath expression. If you are sure that the node you specify only has one instance per fragment, then it will not miss any results (although it might get false-positive results). An example of this is ABSTRACT in MEDLINE citation, where ABSTRACT is below the fragment root, but there is only one ABSTRACT node per fragment. If you specify below a fragment root and there are multiple nodes in the fragment, the search may miss results (it will only find results if they are in the first fragment).

Finally, it is useful to understand that cts:contains implements the filtering step of the two-step search process:

unfiltered cts:search + cts:contains = normal (filtered) cts:search

Breaking a cts:search operation into an unfiltered search and a cts:contains allows you to do the search so it is always fast, but then only do the false-positive result removal if you want or need to. This is true as long as the first parameter to cts:search is at a fragment root node. If it is below a root node, it is only true if you know that the first node is the only possible hit for the search (for example, if there is only one node, as in the ABSTRACT example above).

Using Unfiltered Searches for Fast Pagination

There are many useful applications for unfiltered searches. Applications of unfiltered searches tend to have one or more of the following characteristics:

Your content and search terms are such that you know the unfiltered searches are also accurate (for example, the searches are all performed at document or fragment roots, they are single-term queries, and are not wildcard, punctuation-sensitive, diacritic-sensitive, and/or capitalization-sensitive searches).
You do not mind if there are some false-positive results because the results are an estimate (that is, they need to be fast, but are not required to be exact).
Your searches return a large number of results and you want efficient ways to jump to a particular portion of those results.

The last point describes the situation for fast pagination. Fast pagination is a way to get an approximate count of the total number of result hits and then provide efficient ways to jump deep to an arbitrary point in the result sequence. Such pagination is common in search engine-style results, where a particular result might return 1 million hits and the search interface returns them 10 at a time. There is usually some sort of counter that says something like displaying 1-10 of 1,000,000 results, and then there are links to go to the next page of results or to go to the tenth, twentieth, or any page of the results. Often it is not critical that going to the twentieth page actually gets you to the 200th hit; it is OK if there were some false-positive results, and when you click that link you actually get to the 176th result.

When you implement a fast pagination application, you will need to jump into a position in an unfiltered search. To maximize the efficiency of this search, you must immediately follow the unfiltered cts:search expression with the position predicate, with no XPath steps in between. For example, to jump into the 1,000,001st hit of an unfiltered search for the phrase one two three, the search might look like the following:

cts:search(fn:doc(), "one two three", "unfiltered")[1000001 to 1000010]

This search will skip directly to the 1,000,001st unfiltered hit and return the 10 fragments specified in the position predicate; it will not need to fetch any other fragments.

Example: Determining the Number of False-Positive Matches

The following code sample prints out the number of false-positive matches from a search.

xquery version "1.0-ml";
declare boundary-space preserve;
declare namespace qm="http://marklogic.com/xdmp/query-meters";

let $trueCounter := 0
let $falseCounter := 0
let $searchTerms := "one! two three"
let $x := 
  for $result in cts:search(fn:doc(), $searchTerms, "unfiltered")
  return
  (
  if ( cts:contains($result, $searchTerms) )
  then ( xdmp:set($trueCounter, $trueCounter + 1) )
  else ( xdmp:set($falseCounter, $falseCounter + 1) )
  )
return
<results>
  <resultTotal>{$trueCounter}</resultTotal>
  <false-positiveTotal>{$falseCounter}</false-positiveTotal>
  <elapsed-time>{xdmp:query-meters()/qm:elapsed-time/text()}
  </elapsed-time>
</results>

Because the specified $searchTerms contains punctuation in the middle of the phrase, any document that has the phrase one two three will prove to be a false-positive result. If you substitute in your query terms for the $searchTerms variable, you can see if your own unfiltered search yields false-positive results.

The above code uses the xdmp:set function to keep track of the number of matches and the number of false-positive results. It runs the unfiltered search and then uses cts:contains to check if each result is actually a match. If it is a match, then increment the $trueCounter variable, otherwise increment the $falseCounter variable.