Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 17

Understanding and Using Wildcard Searches

This chapter describes wildcard searches in MarkLogic Server. The following sections are included:

Wildcards in MarkLogic Server

Wildcard searches enable MarkLogic Server to return results that match combinations of characters and wildcards. Wildcard searches are not simply exact string matches, but are based on character pattern matching between the characters specified in a query and words in documents that contain those character patterns. This section describes wildcards and includes the following topics:

Wildcard Characters

MarkLogic Server supports two wildcard characters: * and ?.

  • * matches zero or more non-space characters.
  • ? matches exactly one non-space character.

For example, he* will match any word starting with he, such as he, her, help, hello, helicopter, and so on. On the other hand, he? will only match three-letter words starting with he, such as hem, hen, and so on.

Rules for Wildcard Searches

The following are the basic rules for wildcard searches in MarkLogic Server:

  • There can be more than one wildcard in a single search term or phrase, and the two wildcard characters can be used in combination. For example, m*?? will match words starting with m with three or more characters.
  • Spaces are used as word breaks, and wildcard matching only works within a single word. For example, m*th* will match method but not meet there.
  • If the * wildcard is specified by itself in a value query (for example, cts:element-value-query, cts:element-value-match), it matches everything (spanning word breaks). For example, * will match the value meet me there.
  • If the * wildcard is specified with a non-wildcard character, it will match in value lexicon queries (for example, cts:element-value-match), but will not match in value queries (for example, cts:element-value-query). For example, m* will match the value meet me there for a value lexicon search (for example, cts:element-value-match) but will not match the value for a value query search (for example, cts:element-value-query), because the value query only matches the one word. A value search for m* * will match the value (because m* matches the first word and * matches everything after it).
  • If "wildcarded" is explicitly specified in the cts:query expression, then the search is performed as a wildcard search.
  • If neither "wildcarded" nor "unwildcarded" is specified in the cts:query expression, the database configuration and query text determine wildcarding. If the database has any wildcard indexes enabled (three character searches, two character searches, one character searches, or trailing wildcard searches) and if the query text contains either of the wildcard characters ? or *, then the wildcard characters are treated as wildcards and the search is performed "wildcarded". If none of the wildcard indexes are enabled, the wildcard characters are treated as punctuation and the search is performed unwildcarded (unless "wildcarded" is specified in the cts:query expression).
  • If the query has the punctuation-sensitive option, then punctuation is treated as word characters for wildcard searches. For example, a punctuation-sensitive wildcard search for d*benz would match daimler-benz.
  • If the query has the whitespace-sensitive option, then whitespace is treated as word characters. This can be useful for matching spaces in wildcarded value queries. You can use the whitespace-sensitive option in wildcarded word queries, too, although it might not make much sense, as it will match more than you might expect.

Enabling Wildcard Searches

Wildcard searches use character indexes, lexicons, and trailing wildcard indexes to speed performance. To ensure that wildcard searches are fast, you should enable at least one wildcard index (three character searches, trailing wildcard searches, two character searches, and/or one character searches) and fast element character searches (if you want fast searches within specific elements) in the Admin Interface database configuration screen. Wildcard are disabled by default. If you enable character indexes, you should plan on allocating an additional amount of disk space approximately three times the size of the source content.

This section describes the following topics:

Specifying Wildcards in Queries

If any wildcard indexes are enabled for the database, you can further control the use of wildcards at the query level. You can use wildcards with any of the MarkLogic cts:query leaf-level functions, such as cts:word-query, cts:element-word-query, and cts:element-value-query. For details on the cts:query functions, see Composing cts:query Expressions. You can use the "wildcarded" and "unwildcarded" query option to turn wildcarding on or off explicitly in the cts:query constructor functions. See the MarkLogic XQuery and XSLT Function Reference for more details.

If you leave the wildcard option unspecified and there are any wildcard indexes enabled, MarkLogic Server will perform a wildcard query if * or ? is present in the query. For example, the following search function:

cts:search(fn:doc(), cts:word-query("he*")) 

will result in a wildcard search. Therefore, as long as any wildcard indexes are enabled in the database, you do not have to turn on wildcarding explicitly to perform wildcard searches.

When wildcard indexing is enabled in the database, the system will also deliver higher performance for fn:contains, fn:matches, fn:starts-with and fn:ends-with for most query expressions.

If character indexes, lexicons, and trailing wildcard indexes are all disabled in a database and wildcarding is explicitly enabled in the query (with the "wildcarded" option to the leaf-level cts:query constructor), the query will execute, but might require a lot of processing. Such queries will be fast if they are very selective and only need to do the wildcard searches over a relatively small amount of content, but can take a long time if they actually need to filter out results from a large amount of content.

Recommended Wildcard Index Settings

To enable any kind of wildcard query functionality with a good combination of performance and database size, MarkLogic recommends the following index settings:

  • word searches
  • three character word searches
  • word positions
  • word lexicon in the codepoint collation
  • three character word positions

This combination will provide accurate and fast wildcard queries for a wide variety of wildcard searches, including leading and trailing wildcarded searches. If you add the trailing wildcard searches index, you will get slightly more efficient trailing wildcard searches, but with increased database size. The next section describes the various different wildcard indexes.

Understanding the Different Wildcard Indexes

You configure the index settings at the database level, using the Admin Interface. For details on configuring database settings and on other text indexes, see the 'Databases' and Text Indexes' chapters in the Administrator's Guide.

The following table describes the indexing options that apply to wildcard searches.

Index Description
Word lexicon Speeds up wildcard searches. Works in combination with any other available wildcard indexes to improve search index resolution and performance. When used in conjunction with the three character search index, improves wildcard index resolution and speeds up wildcard searches. If you have three character search and a word lexicon enabled for a database, then there is no need for either the one character or two character search indexes. For best performance, the word lexicon should be in the codepoint collation (http://marklogic.com/collation/codepoint). Additionally, enabling word searches in the database will improve the accuracy of wildcard index resolution (more accurate xdmp:estimate queries).
trailing wildcard searches Speeds up wildcard searches where the search pattern contains the wildcard character at the end (for example, abc*). Turn this index on if you want to speed wildcard searches that match a trailing wildcard. The trailing wildcard search index will use a similar amount of space as the three character search index, but will generally be more efficient for trailing wildcard queries.
three character searches Speeds up wildcard searches where the search pattern contains three or more consecutive non-wildcard characters (for example, *abc). Turn this index on if you want to speed wildcard searches that match three or more characters anywhere in the wildcard expression. When combined with a codepoint word lexicon, speeds the performance of any wildcard search (including searches with fewer than three consecutive non-wildcard characters). MarkLogic recommends combining the three character search index with a codepoint collation word lexicon.
fast element character searches Speeds up wildcard searches within a specific element. Also, speeds up element-based wildcard searches. Turn this index on if you want to improve performance of wildcard searches that query specific elements.
two character searches Speeds up wildcard searches where the search pattern contains two or more consecutive non-wildcard characters. Turn this index on if you want to speed up wildcard searches that match two or more characters (for example, ab*). This index is not needed if you have three character searches and a word lexicon.
Element, Attribute, and Field word lexicons Speeds up wildcard searches for cts:element-value-query and cts:element-attribute-value-query expressions. Works in combination with any other available wildcard indexes to improve search index resolution and performance. When used in conjunction with the three character search index, improves wildcard index resolution and speeds up wildcard searches.
one character searches Speeds up wildcard searches where the search pattern contains only a single non-wildcard character. Turn this index on if you want to speed up wildcard searches that match one or more characters (for example, a*). This index is not needed if you have three character searches and a word lexicon.

As with all indexing, choosing which indexes you need is a trade-off. More indexes provides improved query performance, but uses more disk space and increases load and reindexing time. For most environments where wildcard searches are required, MarkLogic recommends enabling the three character searches and a codepoint collation word lexicon, but disabling one and two character searches.

The three character searches index combined with the word lexicon provides the best performance for most queries, and the fast element character searches index is useful when you submit element queries. One and two character searches indexes are only used if you submit wildcard searches that try and match only one or two characters and you do not have the combination of a word lexicon and the three character searches index. Because one and two character searches generally return a large number of matches, they might not justify the disk space and load time trade-offs.

If you have the three character searches index enabled and two and one character indexes disabled, and if you have no word lexicon, it is still possible to issue a wildcard query that searches for a two or one character stem (for example, ab* or a*); these searches are allowed, but will not be fast. If you have a search user interface that allows users to enter such queries, you might want to check for these two or one character wildcard search patterns and issue an error, as these searches without the corresponding indexes can be slow and resource-intensive. Alternatively, add a codepoint collation word lexicon to your database.

Interaction with Other Search Features

This section describes the interactions between wildcard, stemming, and other search features in MarkLogic Server. The following topics are included:

Wildcarding and Stemming

Wildcard searches can be used in combination with stemming (for details on stemming, see Understanding and Using Stemmed Searches); that is, queries can perform stemmed searches and wildcard searches at the same time. However, the system will not perform a stemmed search on words that are wildcarded. For example, assume a search phrase of running car*. The term running will be matched based on its stem. However, car* will be matched based on a wildcard search, and will match car, cars, carriage, carpenter and so on; stemmed word matches for the words matching the wildcard are not returned.

Wildcarding and Punctuation Sensitivity

Stemming and punctuation sensitivity perform independently of each other. However, there is an interaction between wildcarding and punctuation sensitivity. This section describes this interaction and includes the following parts:

Implicitly and Explicitly Specifying Punctuation

MarkLogic Server allows you to explicitly specify whether a query is punctuation sensitive and whether it uses wildcards. You specify this in the options for the query, as in the following example:

cts:search(fn:doc(), cts:word-query("hello!", "punctuation-sensitive") ) 

If you include a wildcard character in a punctuation sensitive search, it will treat the wildcard as punctuation. For example, the following query matches hello*, but not hellothere:

cts:search(fn:doc(), cts:word-query("hello*", "punctuation-sensitive") ) 

If the punctuation sensitivity option is left unspecified, the system performs a punctuation sensitive search if there is any non-wildcard punctuation in the query terms. For example, if punctuation is not specified, the following query:

cts:search(fn:doc(), cts:word-query("hello!") ) 

will result in a punctuation sensitive search, and the following query:

cts:search(fn:doc(), cts:word-query("hello") ) 

will result in a punctuation insensitive search.

If a search is punctuation sensitive (whether implicitly or explicitly), MarkLogic Server will match the punctuation as well as the search term. Note that punctuation is not considered to be part of a word. For example, mark! is considered to be a word mark next to an exclamation point. If a search is punctuation insensitive, punctuation will match spaces.

Rules for Punctuation and Wildcarding Interaction

The characters ? and * are considered punctuation in documents loaded into the database. However, ? and * are also treated as wildcard characters in a query. This makes for interesting (and occasionally confusing) interaction between wildcarding and punctuation sensitivity.

The following are the rules for the interaction between punctuation and wildcarding. They will help you determine how the system behaves when there are interactions between the punctuation and wildcard characters.

  1. When wildcard indexes are disabled in the database, all queries default to "unwildcarded", and wildcard characters are treated as punctuation. If you specify "wildcarded" in the query, the query is a wildcard query and wildcard characters are treated as wildcards.
  2. Wildcarding trumps (has precedence over) punctuation sensitivity. That is, if the * and/or ? characters are present in a query, * and ? are treated as wildcards and not punctuation unless wildcarding is turned off. If wildcarding is turned off in the query ("unwildcarded"), they are treated as punctuation.
  3. If wildcarding and punctuation sensitivity are both explicitly off and punctuation characters (including * and ?) are in the query, they are treated as spaces.
  4. Wildcarding and punctuation sensitivity can be on at the same time. In this case, punctuation in a document is treated as characters, and wildcards in the query will match any character in the query, including punctuation characters. Therefore, the following query will match both hello* and hellothere:
    cts:search(fn:doc(), 
               cts:word-query("hello*", 
                               ("punctuation-sensitive", "wildcarded") ) 
               ) 
Examples of Wildcard and Punctuation Interactions

This section contains examples of the output of queries in the following categories:

Wildcarding and Punctuation Sensitivity Not Specified (Wildcard Indexes Enabled)

The following examples show queries that are run when at least one wildcard index is enabled and no options are explicitly set on the cts:word-query.

  • Example query: cts:word-query("hello world")

    Actual behavior: Wildcarding off, punctuation insensitive

    Will match: hello world, hello ?! world, hello? world! and so on

  • Example query: cts:word-query("hello?world")

    Actual behavior: Wildcarding on, punctuation insensitive

    Will match: helloaworld

    Will not match: hello world, hello!world

  • Example query: cts:word-query("hello*world")

    Actual behavior: Wildcarding on, punctuation insensitive

    Will match: helloabcworld

    Will not match: hello to world, hello-to-world

  • Example query: cts:word-query("hello * world")

    Actual behavior: Wildcarding on, punctuation insensitive

    Will match: hello to world, hello-to-world

    Will not match: helloaworld, hello world, hello ! world

    Adjacent spaces are collapsed for string comparisons in the server. In the query phrase hello * world, the two spaces on each side of the asterisk are not collapsed for comparison since they are not adjacent to each other. Therefore, hello world is not a match since there is only a single space between hello and world but hello * world requires two spaces because the spaces were not collapsed. The phrase hello ! world is also not a match because ! is treated as a space (punctuation insensitive), and then all three consecutive spaces are collapsed to a single space before the string comparison.

  • Example query: cts:word-query("hello! world")

    Actual behavior: Wildcarding off, punctuation sensitive

    Will match: hello! world

    Will not match: hello world, hello; world

  • Example query: cts:word-query("hey! world?")

    Actual behavior: Wildcarding on, punctuation sensitive

    Will match: hey! world?, hey! world!, hey! worlds

    Will not match: hey. world

Wildcarding Explicitly Off, Punctuation Sensitivity Not Specified

The following examples show the matches for queries that specify "unwildcarded" and do not specify anything about punctuation-sensitivity.

  • Example query: cts:word-query("hello?world", "unwildcarded")

    Actual behavior: Wildcarding off, punctuation sensitive

    Will match: hello?world

    Will not match: hello world, hello;world

  • Example query: cts:word-query("hello*world", "unwildcarded")

    Actual behavior: Wildcarding off, punctuation sensitive

    Will match: hello*world

    Will not match: helloabcworld

Wildcarding Not Specified, Punctuation Sensitivity Explicitly On (Wildcard Indexes Enabled)

The following examples show queries that are run when at least one wildcard index is enabled and the "punctuation-sensitive" option is explicitly set on the cts:word-query.

  • Example query: cts:word-query("hello?world", "punctuation-sensitive")

    Actual behavior: Wildcarding on, punctuation sensitive

    Will match: hello?world, hello.world, hello*world

    Will not match: hello world, hello ! world

  • Example query: cts:word-query("hello * world", "punctuation-sensitive")

    Actual behavior: Wildcarding on, punctuation sensitive

    Will match: hello abc world, hello ! world

    Will not match: hello-!- world

  • Example query: cts:word-query("hello? world", "punctuation-sensitive")

    Actual behavior: Wildcarding on, punctuation sensitive

    Will match: hello! world, (hello) world

    (hello) world is a match because ? matches ) and ( is not considered part of the word hello.

    Will not match: ahello) world, hello to world.

« Previous chapter
Next chapter »