Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 4

Composing cts:query Expressions

Searches in MarkLogic Server use expressions that have a cts:query type. This chapter describes how to create various types of cts:query expressions and how you can register some complex expressions to improve performance of future queries that use the registered cts:query expressions.

MarkLogic Server includes many Built-In XQuery functions to compose cts:query expressions. The signatures and descriptions of the various APIs are described in the MarkLogic XQuery and XSLT Function Reference.

This chapter includes the following sections:

Understanding cts:query

The second parameter for cts:search takes a parameter of cts:query type. The contents of the cts:query expression determines the conditions in which a search will return a document or node. This section describes cts:query and includes the following parts:

cts:query Hierarchy

The cts:query type forms a hierarchy, allowing you to construct complex cts:query expressions by combining multiple expressions together. The hierarchy includes composable and non-composable cts:query constructors. A composable constructor is one that is used to combine multiple cts:query constructors together. A leaf-level constructor is one that cannot be used to combine with other cts:query constructors (although it can be combined using a composable constructor). The following diagram shows the leaf-level cts:query constructors, which are not composable, and the composable cts:query constructors, which you can use to combine both leaf-level and other composable cts:query constructors. For more details on combining cts:query constructors, see the remainder of this chapter.

Use to Narrow the Search

The core search cts:query API is cts:word-query. The cts:word-query function returns true for words or phrases that match its $text parameter, thus narrowing the search to fragments containing terms that match the query. If needed, you can use other cts:query APIs to combine a cts:word-query expression into a more complex expression. Similarly, you can use the other leaf-level cts:query constructors to narrow the results of a search.

Understanding cts:element-query

The cts:element-query function searches through a specified element and all of its children. It is used to narrow the field of search to the specified element hierarchy, exploiting the XML structure in the data. Also, it is composable with other cts:element-query functions, allowing you to specify complex hierarchical conditions in the cts:query expressions.

For example, the following search against a Shakespeare database returns the title of any play that has SCENE elements that have SPEECH elements containing both the words 'room' and 'castle':

for $x in cts:search(fn:doc(), 
   cts:element-query(xs:QName("SCENE"), 
       cts:element-query(xs:QName("SPEECH"), 
           cts:and-query(("room", "castle")) ) ) ) 
return
($x//TITLE)[1]

This query returns the first TITLE element of the play. The TITLE element is used for both play and scene titles, and the first one in a play is the title of the play.

When you use cts:element-query and you have both the word positions and element word positions indexes enabled in the Admin Interface, it will speed the performance of many queries that have multiple term queries (for example, "the long sly fox") by eliminating some false positive results.

Understanding cts:element-word-query

While cts:element-query searches through an element and all of its children, cts:element-word-query searches only the immediate text node children of the specified element. For example, consider the following XML structure:

<root>
  <a>hello
    <b>goodbye</b>
  <a>
</root>

The following query returns false, because "goodbye" is not an immediate text node of the element named a:

cts:element-word-query(xs:QName("a"), "goodbye")

Understanding Field Word and Value Query Constructors

The cts:field-word-query and cts:field-value-query constructors search in fields for either words or values. A field value is defined as all of the text within a field, with a single space between text that comes from different elements. For example, consider the following XML structure:

<name>
  <first>Raymond</first>
  <middle>Clevie</middle>
  <last>Carver</last>
</name>

If you want to normalize names in the form firstname lastname, then you can create a field on this structure. The field might include the element name and exclude the element middle. The value of this instance of the field would then be Raymond Carver, with a space between the text from the two different element values from first and last. If your document contained other name elements with the same structure, their values would be derived similarly. If the field is named my-field, then a cts:field-value-query("my-field", "Raymond Carver") returns true for documents containing this XML. Similarly, a cts:field-word-query("my-field", "Raymond Carver") returns true.

For more information about fields, see Fields Database Settings in the Administrator's Guide. For information on lexicons on fields, see Field Value Lexicons.

Understanding the Range Query Constructors

The cts:element-range-query, cts:element-atribute-range-query, cts:path-range-query, and cts:field-range-query constructors allow you to specify constraints on a value in a cts:query expression. The range query constructors require a range index on the specified element or attribute. For details on range queries, see Using Range Queries in cts:query Expressions.

Understanding the Reverse Query Constructor

The cts:reverse-query constructor allows you to match queries stored in a database to nodes that would match those queries. Reverse queries are used as the basis for alert applications. For details, see Creating Alerting Applications.

Understanding the Geospatial Query Constructors

The geospatial query constructors are used to constrain cts:query expressions on geospatial data. Geospatial searches are used with documents that have been marked up with latitude and longitude data, and can be used to answer queries like 'show me all of the documents that mention places within 100 miles of New York City.' For details on gesospatial searches, see Geospatial Search Applications.

Specifying the Language in a cts:query

All leaf-level cts:query constructors are language-aware; you can either explicitly specify a language value as an option, or it will default to the database default language. The language option specifies the language in which the query is tokenized and, for stemmed searches, the language of the content to be searched.

To specify the language option in a cts:query, use the lang=language_code, where language_code is the two or three character ISO 639-1 or ISO 639-2 language code (http://www.loc.gov/standards/iso639-2/php/code_list.php). For example, the following query:

let $x := 
<root>
 <el xml:lang="en">hello</el>
 <el xml:lang="fr">hello</el>
</root>
return
$x//el[cts:contains(., 
         cts:word-query("hello", ("stemmed", "lang=fr")))]

returns only the French-language node:

<el xml:lang="fr">hello</el>

Depending on the language of the cts:query and on the language of the content, a string will tokenize differently, which will affect the search results. For details on how languages and the xml:lang attribute affect tokenization and searches, see Language Support in MarkLogic Server.

Combining multiple cts:query Expressions

Because cts:query expressions are composable, you can combine multiple expressions to form a single expression. There is no limit to how complex you can make a cts:query expressions. Any API that has a return type of cts:* (for example, cts:query, cts:and-query, and so on) can be composed with another cts:query expression to form another expression. This section has the following parts:

Using cts:and-query and cts:or-query

You can construct arbitrarily complex boolean logic by combining cts:and-query and cts:or-query constructors in a single cts:query expression.

For example, the following search with a relatively simple nested cts:query expression will return all fragments that contain either the word alfa or the word maserati, and also contain either the word saab or the word volvo.

cts:search(fn:doc(),
  cts:and-query( ( cts:or-query(("alfa", "maserati")), 
                   cts:or-query(("saab", "volvo") )
  ) )
)

Additionally, you can use cts:and-not-query and cts:not-query to add negation to your boolean logic.

Proximity Queries using cts:near-query

You can add tests for proximity to a cts:query expression using cts:near-query. Proximity queries use the word positions index in the database and, if you are using cts:element-query, the element word positions index. Proximity queries will still work without these indexes, but the indexes will speed performance of queries that use cts:near-query.

Proximity queries return true if the query matches occur within the specified distance from each other. For more details, see the MarkLogic XQuery and XSLT Function Reference for cts:near-query.

Using Bounded cts:query Expressions

The following cts:query constructors allow you to bound a cts:query expression to one or more documents, a directory, or one or more collections.

These bounding constructors allow you to narrow a set of search results as part of the second parameter to cts:search. Bounding the query in the cts:query expression is much more efficient than filtering results in a where clause, and is often more convenient than modifying the XPath in the first cts:search parameter. To combine a bounded cts:query constructor with another constructor, use a cts:and-query or a cts:or-query constructor.

For example, the following constrains a search to a particular directory, returning the URI of the document(s) that match the cts:query.

for $x in cts:search(fn:doc(), 
   cts:and-query((
     cts:directory-query("/shakespeare/plays/", "infinity"), 
         "all's well that"))
)
return xdmp:node-uri($x)

This query returns the URI of all documents under the specified directory that satisfy the query "all's well that".

In this query, the query "all's well that" is equivalent to a cts:word-query("all's well that").

Matching Nothing and Matching Everything

An empty cts:word-query will always match no fragments, and an empty cts:and-query will always match all fragments. Therefore the following are true:

cts:search(fn:doc(), cts:word-query("") )
=> returns the empty sequence
cts:search(fn:doc(), "" )
=> returns the empty sequence
cts:search(fn:doc(), cts:and-query( () ) )
=> returns every fragment in the database

One use for an empty cts:word-query is when you have a search box that an end user enters terms to search for. If the user enters nothing and hits the submit button, then the corresponding cts:search will return no hits.

An empty cts:and-query that matches everything is sometimes useful when you need a cts:query to match everything.

Joining Documents and Properties with cts:properties-query or cts:document-fragment-query

You can use a cts:properties-query to match content in properties document. If you are searching over a document, then a cts:properties-query will search in the properties document at the URI of the document. The cts:properties-query joins the properties document with its corresponding document. The cts:properties-query takes a cts:query as a parameter, and that query is used to match against the properties document. A cts:properties-query is composable, so you can combine it with other cts:query constructors to create arbitrarily complex queries.

Using a cts:properties-query in a cts:search, you can easily create a query that returns results that join content in a document with content in the corresponding properties document. For example, consider a document that represents a chapter in a book, and the document has properties containing the publisher of the book. you can then write a search that returns documents that match a cts:query where the document has a specific publisher, as in the following example:

cts:search(collection(), cts:and-query((
  cts:properties-query(
    cts:element-value-query(xs:QName("publisher"), "My Press") ),
  cts:word-query("a small good thing") )) )

This query returns all documents with the phrase a small good thing and that have a value of My Press in the publisher element in their corresponding properties document.

Similarly, you can use cts:document-fragment-query to join documents against properties when searching over properties.

Registering cts:query Expressions to Speed Search Performance

If you use the same complex cts:query expressions repeatedly, and if you are using them as an unfiltered cts:query constructor, you can register the cts:query expressions for later use. Registering a cts:query expression stores a pre-evaluated version of the expression, making it faster for subsequent queries to use the same expression. Unfiltered constructors return results directly from the indexes and return all candidate fragments for a search, but do not perform post-filtering to validate that each fragment perfectly meets the search criteria. For details on unfiltered searches, see 'Using Unfiltered Searches for Fast Pagination' in the Query Performance and Tuning Guide.

This section describes registered queries and provides some examples of how to use them. It includes the following topics:

Registered Query APIs

To register and reuse unfiltered searches for cts:query expressions, use the following XQuery APIs:

For the syntax of these functions, see the MarkLogic XQuery and XSLT Function Reference.

Must Be Used Unfiltered

You can only use registered queries on unfiltered constructors; using a registered query as a filtered constructor throws the XDMP-REGFLT exception. To specify an unfiltered constructor, use the "unfiltered" option to cts:registered-query. For details about unfiltered searches, see 'Using Unfiltered Searches for Fast Pagination' in the Query Performance and Tuning Guide.

Registration Does Not Survive System Restart

Registered queries are only stored in the memory cache, and if the cache grows too big, some registered queries might be aged out of the cache. Also, if MarkLogic Server stops or restarts, any queries that were registered are lost and must be re-registered.

If you attempt to call cts:registered-query in a cts:search and the query is not currently registered, it throws an XDMP-UNREGISTERED exception. Because registered queries are not guaranteed to be registered every time they are used, it is good practice to use a try/catch around calls to cts:registered-query, and re-register the query in the catch if the it throws an XDMP-UNREGISTERED exception.

For example, the following sample code shows a cts:registered-query call used with a try/catch expression in XQuery:

(: wrap the registered query in a try/catch :)
try{
xdmp:estimate(cts:search(fn:doc(), 
  cts:registered-query(995175721241192518, "unfiltered")))
}
catch ($e) 
{
let $registered := 'cts:register(
		cts:word-query("hello*world", "wildcarded"))'
return
if ( fn:contains($e/*:code/text(), "XDMP-UNREGISTERED") )
then ( "retry this query with the following registered query ID: ",
       xdmp:eval($registered) )
else ( $e ) 
}

This code is somewhat simplified: it catches the XDMP-UNREGISTERED exception and simply reports what the new registered query ID is. In an application that uses registered queries, you probably would want to re-run the query with the new registered ID. Also, this example performs the try/catch in XQuery. If you are using XCC to issue queries against MarkLogic Server, you can instead perform the try/catch in the middleware Java or .NET layer.

Storing Registered Query IDs

When you register a cts:query expression, the cts:register function returns an integer, which is the ID for the registered query. After the cts:register call returns, there is no way to query the system to find the registered query IDs. Therefore, you might need to store the IDs somewhere. You can either store them in the middleware layer (if you are using XCC to issue queries against MarkLogic Server) or you can store them in a document in MarkLogic Server.

The registered query ID is generated based on a hash of the actual query, so registering the same query multiple times results in the same ID. The registered query ID is valid for all queries against the database across the entire cluster.

Registered Queries and Relevance Calculations

Searches that use registered queries will generate results having different scores from the equivalent searches using a non-registered queries. This is because registered queries are treated as a single term in the relevance calculation. For details on relevance calculations, see Relevance Scores: Understanding and Customizing.

Example: Registering and Using a cts:query Expression

To run a registered query, you first register the query and then run the registered query, specifying it by ID. This section describes some example steps for registering a query and then running the registered query.

  1. First register the cts:query expression you want to run, as in the following example:
    cts:register(cts:word-query("hello*world", "wildcarded"))
  2. The first step returns an integer. Keep track of the integer value (for example, store it in a document).
  3. Use the integer value to run a search with the registered query (with the "unfiltered" option) as follows:
    cts:search(fn:doc(), 
              cts:registered-query(987654321012345678, "unfiltered") ) 

Adding Relevance Information to cts:query Expressions:

The leaf-level cts:query APIs (cts:word-query, cts:element-word-query, and so on) have a weight parameter, which allows you to add a multiplication factor to the scores produced by matches from a query. You can use this to increase or decrease the weight factor for a particular query. For details about score, weight, and relevance calculations, see Relevance Scores: Understanding and Customizing.

XML Serializations of cts:query Constructors

You can create an XML serialization of a cts:query. The XML serialization is used by alerting applications that use a cts:reverse-query constructor and is also useful to perform various programmatic tasks to a cts:query. Alerting applications (see Creating Alerting Applications) find queries that would match nodes, and then perform some action for the query matches. This section describes the serialized XML and includes the following parts:

Serializing a cts:query to XML

A serialized cts:query has XML that conforms to the <marklogic-dir>/Config/cts.xsd schema, which is in the http://marklogic.com/cts namespace, which is bound to the cts prefix. You can either construct the XML directly or, if you use any cts:query expression within the context of an element, MarkLogic Server will automatically serialize that cts:query to XML. Consider the following example:

<some-element>{cts:word-query("hello world")}</some-element>

When you run the above expression, it serializes to the following XML:

<some-element>
  <cts:word-query xmlns:cts="http://marklogic.com/cts">
    <cts:text xml:lang="en">hello world</cts:text>
  </cts:word-query>
</some-element>

If you are using an alerting application, you might choose to store this XML in the database so you can match searches that include cts:reverse-query constructors. For details on alerts, see Creating Alerting Applications.

Add Arbitrary Annotations With cts:annotate

You can annotate your cts:query XML with cts:annotate elements. A cts:annotate element can be a child of any element in the cts:query XML, and it can consist of any valid XML content (for example, a single text node, a single element, multiple elements, complex elements, and so on). MarkLogic Server ignores these annotations when processing the query XML, but such annotations are often useful to the application. For example, you can store information about where the query came from, information about parts of the query to use or not in certain parts of the application, and so on. The following is some sample XML with cts:annotation elements:

<cts:and-query xmlns:cts="http://marklogic.com/cts">
  <cts:directory-query>
    <cts:annotation>private</cts:annotation>
    <cts:uri>/myprivate-dir/</cts:uri>
  </cts:directory-query>
  <cts:and-query>
    <cts:word-query><cts:text>hello</cts:text></cts:word-query>
    <cts:word-query><cts:text>world</cts:text></cts:word-query>
  </cts:and-query>
  <cts:annotation>
    <useful>something useful to the application here</useful>
  </cts:annotation>
</cts:and-query>

For another example that uses cts:annotate to store the original query string in a function that generates a cts:query from a string, see the last part of the example in XML Serializations of cts:query Constructors.

Function to Construct a cts:query From XML

You can turn an XML serialization of a cts:query back into an un-serialized cts:query with the cts:query function. For example, you can turn a serialized cts:query back into a cts:query as follows:

cts:query(
  <cts:word-query xmlns:cts="http://marklogic.com/cts">
    <cts:text>word</cts:text>
  </cts:word-query>
)
(: returns: cts:word-query("word", ("lang=en"), 1) :)

Example: Creating a cts:query Parser

The following sample code shows a simple query string parser that parses double-quote marks to be a phrase, and considers anything else that is separated by one or more spaces to be a single term. If needed, you can use the same design pattern to add other logic to do more complex parsing (for example, OR processing or NOT processing).

xquery version "1.0-ml";
declare function local:get-query-tokens($input as xs:string?) 
  as element() {
(: This parses double-quotes to be exact matches. :)
<tokens>{
let $newInput := fn:string-join(
(: check if there is more than one double-quotation mark.  If there is, 
   tokenize on the double-quotation mark ("), then change the spaces
   in the even tokens to the string "!+!".  This will then allow later
   tokenization on spaces, so you can preserve quoted phrases as phrase
   searches (after re-replacing the "!+!" strings with spaces).  :)
    if ( fn:count(fn:tokenize($input, '"')) > 2 )
    then ( for $i at $count in fn:tokenize($input, '"')
           return
             if ($count mod 2 = 0)
             then fn:replace($i, "\s+", "!+!")
             else $i )
    else ( $input ) , " ")
let $tokenInput := fn:tokenize($newInput, "\s+")

return (
for $x in $tokenInput
where $x ne ""
return
<token>{fn:replace($x, "!\+!", " ")}</token>)
}</tokens>
} ;

let $input := 'this is a "really big" test'
return
local:get-query-tokens($input)

This returns the following:

<tokens>
  <token>this</token>
  <token>is</token>
  <token>a</token>
  <token>really big</token>
  <token>test</token>
</tokens>

Now you can derive a cts:query expression from the tokenized XML produced above, which composes all of the terms with a cts:and-query, as follows (assuming the local:get-query-tokens function above is available to this function):

xquery version "1.0-ml";
declare function local:get-query($input as xs:string) 
{
let $tokens := local:get-query-tokens($input)
return
 cts:and-query( (cts:and-query(
        for $token in $tokens//token
        return 
        cts:word-query($token/text()) ) ))
} ;

let $input := 'this is a "really big" test'
return
local:get-query($input)

This returns the following (spacing and line breaks added for readability):

cts:and-query(
  cts:and-query((
    cts:word-query("this", (), 1), 
    cts:word-query("is", (), 1), 
    cts:word-query("a", (), 1), 
    cts:word-query("really big", (), 1), 
    cts:word-query("test", (), 1)
    ), ()) ,
  () )

You can now take the generated cts:query expression and add it to a cts:search.

Similarly, you can generate a serialized cts:query as follows (assuming the local:get-query-tokens function is available):

xquery version "1.0-ml";
declare function local:get-query-xml($input as xs:string) 
{
let $tokens := local:get-query-tokens($input)
return
 element cts:and-query { 
       element cts:and-query { 
           for $token in $tokens//token
           return 
           element cts:word-query { $token/text() } },
           element cts:annotation {$input} }
} ;

let $input := 'this is a "really big" test'
return
local:get-query-xml($input)

This returns the folllowing XML serialization:

<cts:and-query xmlns:cts="http://marklogic.com/cts">
  <cts:and-query>
    <cts:word-query>this</cts:word-query>
    <cts:word-query>is</cts:word-query>
    <cts:word-query>a</cts:word-query>
    <cts:word-query>really big</cts:word-query>
    <cts:word-query>test</cts:word-query>
  </cts:and-query>
  <cts:annotation>this is a "really big" test</cts:annotation>
</cts:and-query>
« Previous chapter
Next chapter »