MarkLogic Server provides a powerful XML-enabled search capability through the cts:search XQuery built-in function. As part of the search capability, MarkLogic Server allows you to issue cts:search expressions that return results directly from the indexes, without performing the filtering necessary to ensure there are no false-positive results. This chapter describes the search process, including unfiltered searches, and includes the following sections:
When evaluating cts:search expressions (and also when resolving XPath expressions within XQuery code), MarkLogic Server performs a two-step process.
The purpose of index resolution is to narrow the set of candidate fragments to as small a set as possible, without missing any. In some circumstances, the index resolution step can yield a precisely correct set of candidate fragments, rendering the filtering step redundant. In other circumstances, index resolution can reduce the set of candidate fragments somewhat, but in the candidate fragment list there are still false-positive results (that is, candidate fragments that in fact contain no matching results). In still other circumstances, the candidate fragment list can contain fragments that contain more than one matching result.
To better understand false-positive results, imagine a database configuration which has not enabled fast case-sensitive indexes (fast case sensitive searches
on the database configuration page). This means that the full-text indexes only maintain direct lookups for words independent of their case. In this scenario, if you are searching for "Dog"
, the indexes can only tell you what fragments contain the word "dog"
, in any case-combination of text (for example, "dog"
, "DOG"
, "Dog"
, "doG"
, and so on). So when index resolution generates a candidate fragment list for "Dog"
, that list could include a fragment that has the word "dog"
but not the word "Dog"
. This is where filtering comes in--by loading that fragment and examining it prior to returning it as a match, MarkLogic Server is able to determine that it is not in fact a match for the specified query, and rule it out. The fragment is a false-positive result, and should not be returned to the query.
To understand how a candidate fragment can contain more than one match, consider a single-fragment document that contains multiple <author>
elements as follows:
<author>Bruce Smith</author> <author>Betty Smith</author> <author>Gordon Blair</author>
Now consider the following query:
cts:search(//author, "Smith")
During index resolution, this query generates the fragment for that document as a single candidate fragment. In fact, that single document should generate two results--one for each of Bruce and Betty Smith. During the filtering step, MarkLogic Server identifies that there is more than one element in this document that matches the search requirements, and returns both of the first two <author>
elements as results.
As you can see from these examples, the combination of index resolution and filtering combine to provide both performance and accuracy. The algorithm is designed to allow you to write complex queries, and have MarkLogic Server determine the most efficient path providing accurate results.
There are disadvantages to the algorithm, however. Sometimes, you might know better than the search engine, and through careful design of your XML and your fragmentation, you might know that filtering is simply unnecessary. In this case, filtering takes unneeded cycles. In another situation, if you want to jump deep into a result set (for example, retrieving the 1,000,000th result from a really large result set), the emphasis MarkLogic Server has on accuracy through filtering might provide an impractical constraint for your application, because filtering the first 999,999 results will take far too long. Furthermore, it might not matter to your application if, when you jump to the 1,000,000th result, you actually end up at approximately that result (even if in reality it is the 949,237th result).
Consequently, MarkLogic Server provides you with tools to influence the evaluation of cts:search expressions, indicating whether or not filtering is required.
An unfiltered search omits the filtering step, which validates whether each candidate fragment result actually meets the search criteria. Unfiltered searches, therefore, are guaranteed to be fast, while filtered searches are guaranteed to be accurate. By default, searches are filtered; you must specify the "unfiltered"
option to cts:search to return an unfiltered search. Unfiltered searches have the following characteristics:
The following are some useful guidelines for when to use unfiltered searches:
ABSTRACT
in MEDLINE citatation, where ABSTRACT
is below the fragment root, but there is only one ABSTRACT
node per fragment. If you specify below a fragment root and there are multiple nodes in the fragment, the search may miss results (it will only find results if they are in the first fragment).Finally, it is useful to understand that cts:contains implements the filtering step of the two-step search process:
unfiltered cts:search +cts:contains
=normal (filtered) cts:search
Breaking a cts:search operation into an unfiltered search and a cts:contains allows you to do the search so it is always fast, but then only do the false-positive result removal if you want or need to. This is true as long as the first parameter to cts:search is at a fragment root node. If it is below a root node, it is only true if you know that the first node is the only possible hit for the search (for example, if there is only one node, as in the ABSTRACT
example above).
There are many useful applications for unfiltered searches. Applications of unfiltered searches tend to have one or more of the following characteristics:
The last point describes the situation for fast pagination. Fast pagination is a way to get an approximate count of the total number of result hits and then provide efficient ways to jump deep to an arbitrary point in the result sequence. Such pagination is common in search engine-style results, where a particular result might return 1 million hits and the search interface returns them 10 at a time. There is usually some sort of counter that says something like displaying 1-10 of 1,000,000 results, and then there are links to go to the next page of results or to go to the tenth, twentieth, or any page of the results. Often it is not critical that going to the twentieth page actually gets you to the 200th hit; it is OK if there were some false-positive results, and when you click that link you actually get to the 176th result.
When you implement a fast pagination application, you will need to jump into a position in an unfiltered search. To maximize the efficiency of this search, you must immediately follow the unfiltered cts:search
expression with the position predicate, with no XPath steps in between. For example, to jump into the 1,000,001st hit of an unfiltered search for the phrase one two three, the search might look like the following:
cts:search(fn:doc(), "one two three", "unfiltered")[1000001 to 1000010]
This search will skip directly to the 1,000,001st unfiltered hit and return the 10 fragments specified in the position predicate; it will not need to fetch any other fragments.
The following code sample prints out the number of false-positive matches from a search.
xquery version "1.0-ml"; declare boundary-space preserve; declare namespace qm="http://marklogic.com/xdmp/query-meters"; let $trueCounter := 0 let $falseCounter := 0 let $searchTerms := "one! two three" let $x := for $result in cts:search(fn:doc(), $searchTerms, "unfiltered") return ( if ( cts:contains($result, $searchTerms) ) then ( xdmp:set($trueCounter, $trueCounter + 1) ) else ( xdmp:set($falseCounter, $falseCounter + 1) ) ) return <results> <resultTotal>{$trueCounter}</resultTotal> <false-positiveTotal>{$falseCounter}</false-positiveTotal> <elapsed-time>{xdmp:query-meters()/qm:elapsed-time/text()} </elapsed-time> </results>
Because the specified $searchTerms
contains punctuation in the middle of the phrase, any document that has the phrase one two three will prove to be a false-positive result. If you substitute in your query terms for the $searchTerms
variable, you can see if your own unfiltered search yields false-positive results.
The above code uses the xdmp:set function to keep track of the number of matches and the number of false-positive results. It runs the unfiltered search and then uses cts:contains to check if each result is actually a match. If it is a match, then increment the $trueCounter
variable, otherwise increment the $falseCounter
variable.