Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 3

« Previous chapter
Next chapter »

Searching Using String Queries

This chapter describes how to perform searches using simple string queries with Search API. This chapter includes the following sections:

This chapter provides background, design patterns, and examples of using string queries. For the function signatures and descriptions, see the Search documentation under XQuery Library Modules in the MarkLogic XQuery and XSLT Function Reference.

String Query Overview

A string query is a plain text search string composed of terms, phrases, and operators that can be easily composed by end users typing into an application search box. For example, cat AND dog is a string query for finding documents that contain both the term cat and the term dog.

For historical reasons, MarkLogic supports two similar string query grammars. The XQuery Search API, and the REST, Java, and Node.js Client APIs support the grammar discussed in this chapter. The XQuery cts:parse function, the Javascript cts.parse function, and the Javascript jsearch API support a similar grammar; for details, see Creating a Query From Search Text With cts:parse. The two grammars share the same basic set of operators, but differ in how you define constraints and the degree of customizability.

The syntax of a string query is determined by a configurable grammar. A powerful default grammar is pre-defined. You can modify or extend the grammar through the grammar search option. For details, see The Default String Query Grammar and Modifying and Extending the String Query Grammar.

The default grammar provides a robust ability to generate complex queries. The following are some examples of queries that use the default grammar:

  • (cat OR dog) NEAR vet

    at least one of the terms cat or dog within 10 terms (the default distance for cts:near-query) of the word vet

  • dog NEAR/30 vet

    the word dog within 30 terms of the word vet

  • cat -dog

the word cat where there is no word dog.

You can use string queries to search contents and metadata with the following MarkLogic Server APIs:

The Default String Query Grammar

The Search API has a built-in default grammar for interpreting string querys such as cat AND dog. The default grammar enables you to write applications that perform complex queries against a database based on simple search strings. You can also modify the default grammar or define a custom grammar; for details, see Modifying and Extending the String Query Grammar.

Query Components and Operators

Use the following components and operators to form string queries with the default search grammar:

Query Example Description
any terms

dog

dog cat

Match one or more terms, as with a cts:and-query. Adjacent terms and phrases are implicitly joined with AND. For example, dog cat is the same as dog AND cat.
" "

"dog tail"

"dog tail" "cat whisker"

dog "cat whisker"

Terms in double quotes are treated as a phrase. Adjacent terms and phrases are implicitly joined with AND. For example, dog "cat whisker" matches documents containing both the term dog and the phrase cat whisker.
( ) (cat OR dog) zebra Parentheses indicate grouping. The example matches documents containing at least one of the terms cat or dog, and also contain the term zebra.
-query

-dog

-(dog OR cat)

cat -dog

A NOT operation, as with a cts:not-query. For example, cat -dog matches documents that contain the term cat but that do not contain the term dog.
query1 AND query2

dog AND cat

(cat OR dog) AND zebra

Match two query expressions, as with a cts:and-query. For example, dog AND cat matches documents containing both the term dog and the term cat. AND is the default way to combine terms and phrases, so the previous example is equivalent to dog cat.
query1 OR query2 dog OR cat Match either of two queries, as with a cts:or-query. The example matches documents containing at least one of either of terms cat or dog.
query1 NOT_IN query2 dog NOT_IN "dog house" Match one query when the match does not overlap with another, as with cts:not-in-query. The example matches occurrences of dog when it is not in the phrase dog house.
query1 NEAR query2

dog NEAR cat

(cat food) NEAR mouse

Find documents containing matches to the queries on either side of the NEAR operator when the matches occur within 10 terms of each other, as with a cts:near-query. For example, dog NEAR cat matches documents containing dog within 10 terms of cat.
query1 NEAR/N query2 dog NEAR/2 cat Find documents containing matches to the queries on either side of the NEAR operator when the matches occur within N terms of each other, as with a cts:near-query. The example matches documents where the term dog occurs within 2 terms of the term cat.
constraint:value

color:red

decade:1980s birthday:1999-12-31

Find documents that match the named constraint with given value, as with a cts:element-range-query or other range query. For details, see Using Relational Operators on Constraints.
operator:state

sort:relevance

sort:date

Apply a runtime configuration operator such as sort order, defined by an operator XML element or JSON property in the search options. For details, see Operator Options.
constraint LT value color LT red birthday LT 1999-12-31 Find documents that match the named range constraint with a value less than value. For details, see Using Relational Operators on Constraints.
constraint LE value color LE red birthday LE 1999-12-31 Find documents that match the named range constraint with a value less than or equal to value. For details, see Using Relational Operators on Constraints.
constraint GT value color GT red birthday GT 1999-12-31 Find documents that match the named range constraint with a value greater than value. For details, see Using Relational Operators on Constraints.
constraint GE value color GE red birthday GE 1999-12-31 Find documents that match the named range constraint with a value greater than or equal to value. For details, see Using Relational Operators on Constraints.
constraint NE value color NE red birthday NE 1999-12-31 Find documents that match the named range constraint with a value that is not equal to value. For details, see Using Relational Operators on Constraints.
query1 BOOST query2 george BOOST washington Find documents that match query1. Boost the relevance score of documents that also match query2. The example returns all matches for the term george, with matches in documents that also contain washington having a higher relevance score. For more details, see cts:boost-query.

Operator Precedence

The precedence of operators in the default grammar, from highest to lowest, is shown in the following table. Each row in the table represents a precedence level. Where multiple operators have the same precedence, evaluation occurs from left to right. Query sub-expressions using operators higher in the table are evaluated before sub-expressions using operators lower in the table.

Operator
:, LT, LE, GT, GE, NE
-
NOT_IN
BOOST
( ), NEAR, NEAR/N
AND
OR

For example, AND has higher precedence than OR, so the following queries:

A AND B OR C
A OR B AND C

Evaluate as if written as follows:

(A AND B) OR C
A OR (B AND C)

Using Relational Operators on Constraints

The relational query operators :, LT, LE, GT, GE, and NE accept a constraint name on the left hand side and a value on the right hand side. That is, queries using these operators are of the following form:

constraint op value

These relational operators match fragments that meet the named constraint with a value that matches the relationship defined by the operator (equals, less than, greater than, etc.). For example, if your query options define an element word constraint named color, then color:red matches documents that contain elements meeting the color constraint with a value of red. For details and more examples, see Constraint Options.

The constraint name must be the name of a <constraint/> XML element or "constraint" JSON object defined by the query options governing the search. The constraint can be a word, value, range, or geospatial constraint. There must be a range index associated with the constraint.

If the constraint is unbucketed, the value on the right hand side of the operator must be convertible to the type of the constraint. For example, if the range index behind the constraint has type xs:date, then the value to match must represent an xs:date.

If the constraint is bucketed, then the value must be the name of a bucket defined by the constraint. For example, if searching using the decade bucketed constraint defined in Bucketed Range Constraint Example, then the value on the right hand side must be a bucket name such as 1920s or 2000s, such as decade:1920s.

String Query Examples

The default grammar provides a robust ability to generate complex queries. The following are some examples of queries that use the default grammar:

  • (cat OR dog) NEAR vet

    at least one of the terms cat or dog within 10 terms (the default distance for cts:near-query) of the word vet

  • dog NEAR/30 vet

    the word dog within 30 terms of the word vet

  • cat -dog

the word cat where there is no word dog

Modifying and Extending the String Query Grammar

Search API string query grammar customization is deprecated as of MarkLogic 9. You should use a 3rd party library if you require a custom string query grammar. For details, see Search API Grammar Customization Deprecated in the Release Notes.

You can customize the grammar used for constructing string queries by specifying a custom grammar XML element or JSON object in the query options used with a search. A grammar is defined by the following components:

  • starter
  • joiner
  • quotation
  • implicit

A grammar must contain at least one starter, joiner, or implicit element. If a grammar element is present in your query options, but it is empty, the search is parsed according to the term-option settings.

The following is the default string query grammar that implements the syntax and semantics described in The Default String Query Grammar. You can retrieve the default grammar by retrieving the default query options; for details, see Getting the Default Query Options.

<grammar xmlns="http://marklogic.com/appservices/search">
  <quotation>"</quotation>
  <implicit>
    <cts:and-query strength="20" xmlns:cts="http://marklogic.com/cts"/>
  </implicit>
  <starter strength="30" apply="grouping" delimiter=")">(</starter>
  <starter strength="40" apply="prefix" element="cts:not-query">-</starter>
  <joiner strength="10" apply="infix" element="cts:or-query"
     tokenize="word">OR</joiner>
  <joiner strength="20" apply="infix" element="cts:and-query"
     tokenize="word">AND</joiner>
  <joiner strength="30" apply="infix" element="cts:near-query"
     tokenize="word">NEAR</joiner>
  <joiner strength="30" apply="near2" consume="2"
     element="cts:near-query">NEAR/</joiner>
  <joiner strength="32" apply="boost" element="cts:boost-query"
     tokenize="word">BOOST</joiner>
  <joiner strength="35" apply="not-in" element="cts:not-in-query"
     tokenize="word">NOT_IN</joiner>
  <joiner strength="50" apply="constraint">:</joiner>
  <joiner strength="50" apply="constraint" compare="LT"
     tokenize="word">LT</joiner>
  <joiner strength="50" apply="constraint" compare="LE"
     tokenize="word">LE</joiner>
  <joiner strength="50" apply="constraint" compare="GT"
     tokenize="word">GT</joiner>
  <joiner strength="50" apply="constraint" compare="GE"
     tokenize="word">GE</joiner>
  <joiner strength="50" apply="constraint" compare="NE"
     tokenize="word">NE</joiner>
</grammar>

The following table describes the concepts used in the search grammar:

Concept Description
implicit The implicit grammar element specifies the cts:query to use by default to join two search terms together. By default, the Search API uses a cts:and-query, but you can change it to any cts:query with the implicit grammar option.
starter A starter is a string that appears before a term to denote special parsing for the term, for example, the minus sign ( - ) for negation. Additionally, when used with the delimiter attribute, a starter specifies starting and ending strings that separate terms for grouping things together, and allows the grammar to set an order of precedence for terms when parsing a string.
joiner A joiner is a string that combines two terms together. For example, AND and OR function as joiners in these queries using the default grammar:
cat AND dog
cat OR dog
The default grammar also uses joiners for the string that separates a constraint or operator from its value, as described in Constraint Options and Operator Options. If joiner/@tokenize is set to "word" attribute is present, then the terms and the joiner must be whitespace-separated; otherwise the parser looks for the joiner string anywhere in the query text.
quotation The quotation string specifies the string to use to indicate the start and end of a phrase. For example, in the default grammar, the following is parsed as a phrase (instead of a sequence of terms combined with an AND):
"this is a phrase"
strength The strength attribute provides the parser with information on which tokens are processed first. Higher strength tokens or groups are processed before lower strength tokens or groups.

The starter elements define how to parse portions of the grammar. The apply attributes specify the functions to which the starter and the delimiter apply.

The joiner elements define how to parse various operators, constraints, and other operations and specifies the functions that define the joiner's behavior. For example, if you wanted to change the OR joiner above, which joins tokens with a cts:or-query, to use the pipe character ( | ) instead, you would substitute the following joiner element for the one above:

  <search:joiner strength="10" apply="infix" element="cts:or-query"
       tokenize="word">|</search:joiner>

Setting @tokenize to word specifies that a token must have whitespace immediately before and after it in order to be recognized. Without that attribute, if OR was the joiner, then a search for CORN would result in a search for C OR N (cts:or-query(("C"), ("N"))). With joiners used in constraints (for example, the colon character :), you probably do not want that, so the tokenize attribute is omitted, thus allowing searches like decade:1990s to parse as a constraint.

You can add a joiner string to specify the composable cts:query elements that take a sequence of queries (cts:or-query, cts:and-query, or cts:near-query) by specifying the element in the element attribute on an apply="infix" joiner. For example, the following search:joiner element specifies a joiner for cts:near-query, which would combine the surrounding terms with a cts:near-query (and would use the default distance of 10) using the joiner string CLOSETO:

<search:joiner strength="10" apply="infix" element="cts:near-query"
       tokenize="word">CLOSETO</search:joiner>

Using the above joiner specification, the following query text bicycle CLOSETO shop would return matches that have bicycle and shop within 10 words of each other.

By default, the search grammar is very powerful, and implements a grammar similar to the Google grammar. With the customization, you can make it even more powerful and customize it to your specific needs. To add custom parsing, you must implement a function and use the apply, ns, at design pattern (described in Search Customization Via Options and Extensions) and construct a search:grammar options node to point to the function(s) you implemented.

starter

A starter defines a unary prefix operator or a pair of grouping symbols. For example, the default grammar defines the minus sign ( - ) as a starter for negation and parentheses ( () ) as a grouping starter.

A grammar query option can contain 0 or more starter elements, but must contain at least one starter or joiner.

Do the following to define a unary starter operator in your grammar:

  1. Identify the XQuery parsing function using the apply, at, and ns, as described in Search Customization Via Options and Extensions.
  2. Put the operator token in the XML <starter/> text node or the JSON label sub-object.
  3. Set strength to reflect the evaluation precedence this operator should have relative to other operators in the same grammar.
  4. Set element to the QName of the cts:query element returned by the parsing function. For example, the negation operator defined by the default grammar produces a cts:not-query element.
  5. Optionally, set options to a space separated list of search options to pass to the parsing function.

For example, the default grammar defines a unary - operator as follows:

XML JSON
<starter strength="40" apply="prefix" 
  element="cts:not-query">-</starter>
"starter": [
  {
    "strength": 40,
    "apply": "prefix",
    "element": "cts:not-query",
    "label": "-"
  }
]

Do the following to define a grouping symbol in your grammar:

  1. Identify the XQuery parsing function using the apply, at, and ns, as described in Search Customization Via Options and Extensions.
  2. Put the grouping start token in the XML <starter/> text node or the JSON label sub-object.
  3. Set delimiter to the grouping end token.
  4. Set strength to reflect the evaluation precedence this operator should have relative to other operators in the same grammar.
  5. Set element to the QName of the cts:query element returned by the parsing function. For example, the negation operator defined by the default grammar produces a cts:not-query element.
  6. Optionally, set options to a space separated list of search options to pass to the parsing function.

For example, the default grammar defines ( ) as grouping tokens as follows:

XML JSON
<starter strength="30" apply="grouping" 
  delimiter=")">(</starter>
"starter": [
  {
    "strength": 30,
    "apply": "grouping",
    "delimiter": ")",
    "label": "("
  }
]

joiner

A joiner defines a binary operator that joins two string query expressions. Examples of joiners in the default grammar include AND, OR, LT, and colon ( : ).

A grammar query option can contain 0 or more joiners, but must contain at least one starter or joiner.

Do the following to define a joiner:

  1. Identify the XQuery parsing function using the apply, at, and ns, as described in Search Customization Via Options and Extensions.
  2. Put the operator token or symbol in the XML <joiner/> text node or the JSON label sub-object.
  3. Set strength to reflect the evaluation precedence this operator should have relative to other operators in the same grammar.
  4. Set element to the QName of the cts:query element returned by the parsing function. For example, the AND operator defined by the default grammar produces a cts:and-query element.
  5. Optionally, set options to a space separated list of search options to pass to the parsing function.

To define a prefix operator, put the operator token in the XML <starter/> text node or the JSON label sub-object. For example, the default grammar defines a unary - operator as follows:

XML JSON
<starter strength="40" apply="prefix" 
  element="cts:not-query">-</starter>
"starter": [
  {
    "strength": 40,
    "apply": "prefix",
    "element": "cts:not-query",
    "label": "-"
  }
]

quotation

The quotation grammar element defines symbol used to demarcate phrases. The default grammar uses double quotes ( " ):

<quotation>"</quotation>

A grammar can contain at most one quotation definition.

In XML, place the symbol in the text node of the <quotation/> element. In JSON, place the symbol in the string value associated with the quotation key. For example, to use percent ( % ) instead of double quotes for phrases, include the following in your grammar:

XML JSON
<quotation>%</quotation>
"quotation": "%"

implicit

The implicit grammar element specifies how to handle adjacent terms that are are not separated by an explicit joiner operator. For example, how to interpret a string query such as cat dog. A grammar query option can contain at most one implicit rule.

Do the following to define an implicit operation:

  1. Select a query type from the cts:query hierarchy defined in cts:query Hierarchy.
  2. If you are building XML query options, add a child element of the appropriate type to the <implicit/> element. You can build this using the cts:query constructors. For example, you can construct an empty cts:and-query by evaluating cts:and-query((), ()).
  3. If you are building JSON query options, set the value associated with the implicit key to the serialized representation of the cts:query element type selected in Step 1.

For example, the default grammar includes an implicit rule that specifies cts:and-query as the implicit operation, so cat dog is equivalent to cat AND dog:

XML JSON
<implicit>
  <cts:and-query
    xmlns:cts="http://marklogic.com/cts"/>
</implicit>
"implicit":
  "<cts:and-query xmlns='http://marklogic.com/cts'/>"
« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy