Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 17

« Previous chapter
Next chapter »

Using fn:count vs. xdmp:estimate

This chapter descibes some of the differences between the fn:count and xdmp:estimate functions, and includes the following sections:

fn:count is Accurate, xdmp:estimate is Fast

The XQuery language provides general support for counting the number of items in a sequence through the use of the fn:count function. However, the general-purpose nature of fn:count makes it difficult to optimize. Sequences to be counted can include arbitrarily complex combinations of sequences stored in the database, constructed dynamically, filtered after retrieval or construction, etc. In most cases, MarkLogic Server must process the sequence in order to count it. This can have significant I/O requirements that would impact performance.

MarkLogic Server provides the xdmp:estimate XQuery built-in as an efficient way to approximate fn:count. Unlike fn:count, which frequently must process its answer by inspecting the data directly (hence the heavy I/O loads), xdmp:estimate computes its answer directly from indexes. In certain situations, the index-derived value will be identical to the value returned by fn:count. In others, the values differ to a varying degree depending on the specified sequence and the data. In instances where xdmp:estimate is not able to return a fast estimate, it will throw an error. Hence, you can depend on xdmp:estimate to be fast, just as you can depend on fn:count to be accurate.

Effectively, xdmp:estimate puts the decision to optimize counting through use of the indexes in the hands of the developer.

The xdmp:estimate Built-In Function

xdmp:estimate accepts searchable XPath expressions as its parameter and returns an approximation of the number of items in the sequence:

xdmp:estimate(/book)
xdmp:estimate(//titlepage[cts:contains(., "primer")])
xdmp:estimate(cts:search(//titlepage, cts:word-query("primer")))
xdmp:estimate(/object[.//id = "57483"])

xdmp:estimate does not always return the same value as fn:count. The fn:count function returns the exact number of items in the sequence that is provided as a parameter. In contrast, xdmp:estimate provides an answer based on the following rules:

  1. If the parameter passed to xdmp:estimate is a searchable XPath expression, xdmp:estimate returns the number of fragments that it will select from the database for post-filtering. This number is computed directly from the indexes at extremely high performance. It may, however, differ from the actual fn:count of the sequence specified if either (a) there are multiple matching items within a single fragment or (b) there are fragments provisionally selected by the indexes that do not actually contain a matching item.
  2. If the parameter passed to xdmp:estimate is not a searchable XPath expression (that is, it is not an XPath rooted at a doc, collection(), or input() function, or a / or // step), xdmp:estimate will throw an error.
xdmp:estimate is defined in this way to ensure a sharp contrast against the fn:count function. xdmp:estimate will always execute quickly. fn:count will always return the correct answer. Over time, as MarkLogic improves the server's underlying optimization capability, there will be an increasing number of scenarios in which fn:count is both correct and fast. But for the moment, we put the decision about which approach to take in the developer's hands.

Using cts:remainder to Estimate the Size of a Search

When you need to retrieve both search results and an estimate of the number of matching fragments as part of the same query statement, use the cts:remainder function. Running cts:remainder on a node or nodes returned by a search is more efficient that running xdmp:estimate on the sequence of nodes returned by cts:search. If you just need the estimate, but not the search results, then xdmp:estimate is more efficient.

cts:remainder returns the number of nodes remaining from a particular node of a search result set. When you run it on the first node, it returns the same result as xdmp:estimate on the search. cts:remainder also has the flexibility to return the estimated results of a search starting with any item in the search (for example, how many results remain after the 500th search item), and it does this in an efficient way.

Like xdmp:estimate, cts:remainder uses the indexes to find the approximate results based on unfiltered results. For an explanation of unfiltered results, see Using Unfiltered Searches for Fast Pagination in the Query Performance and Tuning Guide. For the syntax and examples of cts:remainder, see the MarkLogic XQuery and XSLT Function Reference.

When to Use xdmp:estimate

MarkLogic Server uses its indexes to approximate the identification of XML fragments that may contain constructs that matches the specified XPath. This set of fragments is then filtered to determine the exact nodes to return for further processing.

For searchable XPath expressions, xdmp:estimate returns the number of fragments selected in the first approximation step described above. Because this operation is carried out directly from indexes, the operation is virtually instantaneous. However, there are two scenarios in which this approximation will not match the results that would be returned by fn:count:

  1. If a fragment contains more than one matching item for the XPath specified, xdmp:estimate will undercount these items as a single item whereas fn:count would count them individually.
  2. In addition, it is possible to overcount. Index optimization sometimes must over-select in order to ensure that no matching item is missed. During general query processing, these over-selected fragments are discarded in the second-stage filtering process. But xdmp:estimate will count these fragments as matching items whereas fn:count would exclude them.
Consider the sample query outlined below. The first step in the optimization algorithm outlined above is illustrated by the xdmp:query-trace output shown after the query:

Query:

/MedlineCitationSet/MedlineCitation//Author[LastName="Smith"])

Query trace output:

2004-04-06 17:49:39 Info: eval line 5: Analyzing path: fn:doc()/child::MedlineCitationSet/child::MedlineCitation/
descendant::Author[child::LastName = "Smith"]
2004-04-06 17:49:39 Info: eval line 5: Step 1 is searchable: fn:doc()
2004-04-06 17:49:39 Info: eval line 4: Step 2 axis does not use indexes:child
2004-04-06 17:49:39 Info: eval line 4: Step 2 test is searchable: MedlineCitationSet
2004-04-06 17:49:39 Info: eval line 5: Step 2 is searchable: child::MedlineCitationSet
2004-04-06 17:49:39 Info: eval line 4: Step 3 axis does not use indexes:child
2004-04-06 17:49:39 Info: eval line 4: Step 3 test is searchable: MedlineCitation
2004-04-06 17:49:39 Info: eval line 5: Step 3 is searchable: child::MedlineCitation
2004-04-06 17:49:39 Info: eval line 5: Step 4 axis does not use indexes:descendant
2004-04-06 17:49:39 Info: eval line 5: Step 4 test is searchable: Author
2004-04-06 17:49:39 Info: eval line 5: Step 4 predicate 1 is searchable:
child::LastName = "Smith"
2004-04-06 17:49:39 Info: eval line 5: Step 4 is searchable: descendant::Author[child::LastName = "Smith"]
2004-04-06 17:49:39 Info: eval line 5: Path is searchable.
2004-04-06 17:49:39 Info: eval line 5: Gathering constraints.
2004-04-06 17:49:39 Info: eval line 4: Step 2 test contributed 1 constraint: MedlineCitationSet
2004-04-06 17:49:39 Info: eval line 4: Step 3 test contributed 2 constraints: MedlineCitation
2004-04-06 17:49:39 Info: eval line 5: Step 4 test contributed 1 constraint: Author
2004-04-06 17:49:39 Info: eval line 4: Comparison contributed hash value constraint: LastName = "Smith"
2004-04-06 17:49:39 Info: eval line 5: Step 4 predicate 1 contributed 1 constraint: child::LastName = "Smith"
2004-04-06 17:49:39 Info: eval line 5: Executing search.
2004-04-06 17:49:39 Info: eval line 5: Selected 263 fragments to filter

In this scenario, applying fn:count to the XPath provided would tell us that there are 271 authors with a last name of "Smith" in the database. Using xdmp:estimate yields an answer of 263. In this example, xdmp:estimate undercounted because there are fragments with multiple authors named "Smith" in the database, and xdmp:estimate only counts the number of fragments.

Understanding when these situations will occur with a given database and dataset requires an in-depth understanding of the optimizer. Given that the optimizer evolves with every release of the server, this is a daunting task.

The following three sets of guidelines will help you know when and how to use xdmp:estimate:

When Estimates Are Good Enough

In some situations, an estimate of the correct answer is good enough. Many search engines use this approach today, only estimating the total number of "hits" when displaying the first twenty results to the user. In scenarios in which the exact count is not important, it makes sense to use xdmp:estimate.

When XPaths Meet The Right Criteria

If you need to get the precise answer rather than just an approximation, there are some simple criteria to keep in mind if you want to use xdmp:estimate for its performance benefits:

  1. Counting nodes that are either fragment or document roots will always return the correct result.

    Examples:

    xdmp:estimate(/node-name) is equivalent to count(/node-name)xdmp:estimate(//MedlineCitation) is equivalent to count(//MedlineCitation)

    if MedlineCitation is a fragment-root. For example, this constraint is how the sample Medline application is configured in the sample code on http://support.marklogic.com.

  2. If a single fragment can contain more than one element that matches a predicate, you have the potential for undercounting. Assume that the sample data below resides in a single fragment:
    <authors>
      <author>
        <last-name>Smith</last-name>
        <first-name>Alison</first-name>
      </author>
      <author>
        <last-name>Smith</last-name>
        <first-name>James</first-name>
      </author>
      <author>
        <last-name>Peterson</last-name>
        <first-name>David</first-name>
      </author>
    </authors>

    In this case, an XPath which specifies fn:doc()//author[last-name = "Smith"] will undercount, counting only one item for the two matches in the above sample data.

  3. If the XPath contains multiple predicates, you have the potential of overcounting. Using the sample data above, an XPath which specifies fn:doc()//author[last-name = "Smith"][first-name = "David"] will not have any matches. However, since the above fragment contains author elements that satisfy the predicates [last-name = "Smith"] and [first-name = "David"] individually, it will be selected for post-filtering. In this case, xdmp:estimate will consider the above fragment a match and overcount.

When Empirical Tests Demonstrate Correctness

As a last step, you can use two techniques to understand the value that will be returned by xdmp:estimate:

  1. At development time, use xdmp:estimate and fn:count to count the same sequence and see if the results are different for datasets which exhibit all the structural variation you expect in your production dataset.
  2. Turn on xdmp:query-trace, evaluate the XPath sequence that you wish to use with xdmp:estimate, and inspect the query-trace output in the log file. This output will tell you how much of the XPath was searchable, how many fragments were selected (this is the answer that xdmp:estimate will provide), and how many ultimately matched (this is the answer that fn:count will provide).
« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy