This chapter descibes some of the differences between the fn:count and xdmp:estimate functions, and includes the following sections:
The XQuery language provides general support for counting the number of items in a sequence through the use of the fn:count
function. However, the general-purpose nature of fn:count
makes it difficult to optimize. Sequences to be counted can include arbitrarily complex combinations of sequences stored in the database, constructed dynamically, filtered after retrieval or construction, etc. In most cases, MarkLogic Server must process the sequence in order to count it. This can have significant I/O requirements that would impact performance.
MarkLogic Server provides the xdmp:estimate XQuery built-in as an efficient way to approximate fn:count
. Unlike fn:count
, which frequently must process its answer by inspecting the data directly (hence the heavy I/O loads), xdmp:estimate computes its answer directly from indexes. In certain situations, the index-derived value will be identical to the value returned by fn:count
. In others, the values differ to a varying degree depending on the specified sequence and the data. In instances where xdmp:estimate is not able to return a fast estimate, it will throw an error. Hence, you can depend on xdmp:estimate to be fast, just as you can depend on fn:count to be accurate.
Effectively, xdmp:estimate puts the decision to optimize counting through use of the indexes in the hands of the developer.
xdmp:estimate accepts searchable XPath expressions as its parameter and returns an approximation of the number of items in the sequence:
xdmp:estimate(/book) xdmp:estimate(//titlepage[cts:contains(., "primer")]) xdmp:estimate(cts:search(//titlepage, cts:word-query("primer"))) xdmp:estimate(/object[.//id = "57483"])
xdmp:estimate
does not always return the same value as fn:count. The fn:count function returns the exact number of items in the sequence that is provided as a parameter. In contrast, xdmp:estimate provides an answer based on the following rules:
doc
, collection()
, or input()
function, or a / or // step), xdmp:estimate will throw an error.xdmp:estimate
is defined in this way to ensure a sharp contrast against the fn:count
function. xdmp:estimate
will always execute quickly. fn:count
will always return the correct answer. Over time, as MarkLogic improves the server's underlying optimization capability, there will be an increasing number of scenarios in which fn:count
is both correct and fast. But for the moment, we put the decision about which approach to take in the developer's hands.When you need to retrieve both search results and an estimate of the number of matching fragments as part of the same query statement, use the cts:remainder function. Running cts:remainder on a node or nodes returned by a search is more efficient that running xdmp:estimate on the sequence of nodes returned by cts:search. If you just need the estimate, but not the search results, then xdmp:estimate is more efficient.
cts:remainder returns the number of nodes remaining from a particular node of a search result set. When you run it on the first node, it returns the same result as xdmp:estimate on the search. cts:remainder also has the flexibility to return the estimated results of a search starting with any item in the search (for example, how many results remain after the 500th search item), and it does this in an efficient way.
Like xdmp:estimate, cts:remainder uses the indexes to find the approximate results based on unfiltered results. For an explanation of unfiltered results, see Using Unfiltered Searches for Fast Pagination in the Query Performance and Tuning Guide. For the syntax and examples of cts:remainder, see the MarkLogic XQuery and XSLT Function Reference.
MarkLogic Server uses its indexes to approximate the identification of XML fragments that may contain constructs that matches the specified XPath. This set of fragments is then filtered to determine the exact nodes to return for further processing.
For searchable XPath expressions, xdmp:estimate returns the number of fragments selected in the first approximation step described above. Because this operation is carried out directly from indexes, the operation is virtually instantaneous. However, there are two scenarios in which this approximation will not match the results that would be returned by fn:count:
xdmp:query-trace
output shown after the query:/MedlineCitationSet/MedlineCitation//Author[LastName="Smith"])
2004-04-06 17:49:39 Info: eval line 5: Analyzing path: fn:doc()/child::MedlineCitationSet/child::MedlineCitation/ descendant::Author[child::LastName = "Smith"] 2004-04-06 17:49:39 Info: eval line 5: Step 1 is searchable: fn:doc() 2004-04-06 17:49:39 Info: eval line 4: Step 2 axis does not use indexes:child 2004-04-06 17:49:39 Info: eval line 4: Step 2 test is searchable: MedlineCitationSet 2004-04-06 17:49:39 Info: eval line 5: Step 2 is searchable: child::MedlineCitationSet 2004-04-06 17:49:39 Info: eval line 4: Step 3 axis does not use indexes:child 2004-04-06 17:49:39 Info: eval line 4: Step 3 test is searchable: MedlineCitation 2004-04-06 17:49:39 Info: eval line 5: Step 3 is searchable: child::MedlineCitation 2004-04-06 17:49:39 Info: eval line 5: Step 4 axis does not use indexes:descendant 2004-04-06 17:49:39 Info: eval line 5: Step 4 test is searchable: Author 2004-04-06 17:49:39 Info: eval line 5: Step 4 predicate 1 is searchable: child::LastName = "Smith" 2004-04-06 17:49:39 Info: eval line 5: Step 4 is searchable: descendant::Author[child::LastName = "Smith"] 2004-04-06 17:49:39 Info: eval line 5: Path is searchable. 2004-04-06 17:49:39 Info: eval line 5: Gathering constraints. 2004-04-06 17:49:39 Info: eval line 4: Step 2 test contributed 1 constraint: MedlineCitationSet 2004-04-06 17:49:39 Info: eval line 4: Step 3 test contributed 2 constraints: MedlineCitation 2004-04-06 17:49:39 Info: eval line 5: Step 4 test contributed 1 constraint: Author 2004-04-06 17:49:39 Info: eval line 4: Comparison contributed hash value constraint: LastName = "Smith" 2004-04-06 17:49:39 Info: eval line 5: Step 4 predicate 1 contributed 1 constraint: child::LastName = "Smith" 2004-04-06 17:49:39 Info: eval line 5: Executing search. 2004-04-06 17:49:39 Info: eval line 5: Selected 263 fragments to filter
In this scenario, applying fn:count to the XPath provided would tell us that there are 271 authors with a last name of "Smith" in the database. Using xdmp:estimate yields an answer of 263. In this example, xdmp:estimate undercounted because there are fragments with multiple authors named "Smith" in the database, and xdmp:estimate only counts the number of fragments.
Understanding when these situations will occur with a given database and dataset requires an in-depth understanding of the optimizer. Given that the optimizer evolves with every release of the server, this is a daunting task.
The following three sets of guidelines will help you know when and how to use xdmp:estimate:
In some situations, an estimate of the correct answer is good enough. Many search engines use this approach today, only estimating the total number of "hits" when displaying the first twenty results to the user. In scenarios in which the exact count is not important, it makes sense to use xdmp:estimate.
If you need to get the precise answer rather than just an approximation, there are some simple criteria to keep in mind if you want to use xdmp:estimate for its performance benefits:
xdmp:estimate(/node-name)
is equivalent tocount(/node-name)
xdmp:estimate(//MedlineCitation)
is equivalent tocount(//MedlineCitation)
if MedlineCitation
is a fragment-root. For example, this constraint is how the sample Medline application is configured in the sample code on http://support.marklogic.com
.
<authors> <author> <last-name>Smith</last-name> <first-name>Alison</first-name> </author> <author> <last-name>Smith</last-name> <first-name>James</first-name> </author> <author> <last-name>Peterson</last-name> <first-name>David</first-name> </author> </authors>
In this case, an XPath which specifies fn:doc()//author[last-name = "Smith"]
will undercount, counting only one item for the two matches in the above sample data.
fn:doc()//author[last-name = "Smith"][first-name = "David"]
will not have any matches. However, since the above fragment contains author elements that satisfy the predicates [last-name = "Smith"]
and [first-name = "David"]
individually, it will be selected for post-filtering. In this case, xdmp:estimate will consider the above fragment a match and overcount.As a last step, you can use two techniques to understand the value that will be returned by xdmp:estimate: