Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 21

« Previous chapter
Next chapter »

Understanding and Using Wildcard Searches

This chapter describes wildcard searches in MarkLogic Server. The following sections are included:

Wildcards in MarkLogic Server

Wildcard searches enable MarkLogic Server to return results that match combinations of characters and wildcards. Wildcard searches are not simply exact string matches, but are based on character pattern matching between the characters specified in a query and words in documents that contain those character patterns. This section describes wildcards and includes the following topics:

Wildcard Characters

MarkLogic Server supports two wildcard characters: * and ?.

  • * matches zero or more non-space characters.
  • ? matches exactly one non-space character.

For example, he* will match any word starting with he, such as he, her, help, hello, helicopter, and so on. On the other hand, he? will only match three-letter words starting with he, such as hem, hen, and so on.

Rules for Wildcard Searches

The following are the basic rules for wildcard searches in MarkLogic Server:

  • There can be more than one wildcard in a single search term or phrase, and the two wildcard characters can be used in combination. For example, m*?? will match words starting with m with three or more characters.
  • Spaces are used as word breaks, and wildcard matching only works within a single word. For example, m*th* will match method but not meet there.
  • If the * wildcard is specified by itself in a value query (for example, cts:element-value-query, cts:element-value-match), it matches everything (spanning word breaks). For example, * will match the value meet me there.
  • If the * wildcard is specified with a non-wildcard character, it will match in value lexicon queries (for example, cts:element-value-match), but will not match in value queries (for example, cts:element-value-query). For example, m* will match the value meet me there for a value lexicon search (for example, cts:element-value-match) but will not match the value for a value query search (for example, cts:element-value-query), because the value query only matches the one word. A value search for m* * will match the value (because m* matches the first word and * matches everything after it).
  • If "wildcarded" is explicitly specified in the cts:query expression, then the search is performed as a wildcard search.
  • If neither "wildcarded" nor "unwildcarded" is specified in the cts:query expression, the database configuration and query text determine wildcarding. If the database has any wildcard indexes enabled (three character searches, two character searches, one character searches, or trailing wildcard searches) and if the query text contains either of the wildcard characters ? or *, then the wildcard characters are treated as wildcards and the search is performed "wildcarded". If none of the wildcard indexes are enabled, the wildcard characters are treated as punctuation and the search is performed unwildcarded (unless "wildcarded" is specified in the cts:query expression).
  • If the query has the punctuation-sensitive option, then punctuation is treated as word characters for wildcard searches. For example, a punctuation-sensitive wildcard search for d*benz would match daimler-benz.
  • If the query has the whitespace-sensitive option, then whitespace is treated as word characters. This can be useful for matching spaces in wildcarded value queries. You can use the whitespace-sensitive option in wildcarded word queries, too, although it might not make much sense, as it will match more than you might expect.
  • You can only perform wildcard matches against JSON properties with text (string) values. Numbers, booleans, nulls are indexed separately in JSON. For details, see Creating Indexes and Lexicons Over JSON Documents in the Application Developer's Guide.

Enabling Wildcard Searches

Wildcard searches use character indexes, lexicons, and trailing wildcard indexes to speed performance. To ensure that wildcard searches are fast, you should enable at least one wildcard index (three character searches, trailing wildcard searches, two character searches, and/or one character searches) and fast element character searches (if you want fast searches within specific elements) in the Admin Interface database configuration screen. Wildcard searches are disabled by default. If you enable character indexes, you should plan on allocating an additional amount of disk space approximately three times the size of the source content.

This section describes the following topics:

Specifying Wildcards in Queries

If any wildcard indexes are enabled for the database, you can further control the use of wildcards at the query level. You can use wildcards with any of the MarkLogic cts:query leaf-level functions, such as cts:word-query, cts:element-word-query, and cts:element-value-query. For details on the cts:query functions, see Composing cts:query Expressions. You can use the "wildcarded" and "unwildcarded" query option to turn wildcarding on or off explicitly in the cts:query constructor functions. See the MarkLogic XQuery and XSLT Function Reference for more details.

If you leave the wildcard option unspecified and there are any wildcard indexes enabled, MarkLogic Server will perform a wildcard query if * or ? is present in the query. For example, the following search function:

cts:search(fn:doc(), cts:word-query("he*")) 

will result in a wildcard search. Therefore, as long as any wildcard indexes are enabled in the database, you do not have to turn on wildcarding explicitly to perform wildcard searches.

When wildcard indexing is enabled in the database, the system will also deliver higher performance for fn:contains, fn:matches, fn:starts-with and fn:ends-with for most query expressions.

If character indexes, lexicons, and trailing wildcard indexes are all disabled in a database and wildcarding is explicitly enabled in the query (with the "wildcarded" option to the leaf-level cts:query constructor), the query will execute, but might require a lot of processing. Such queries will be fast if they are very selective and only need to do the wildcard searches over a relatively small amount of content, but can take a long time if they actually need to filter out results from a large amount of content.

Recommended Wildcard Index Settings

To enable any kind of wildcard query functionality with a good combination of performance and database size, MarkLogic recommends you enable the following index settings:

  • word searches
  • three character word searches
  • word positions
  • word lexicon in the codepoint collation
  • three character word positions

For details, see Understanding the Wildcard Indexes and Understanding the Text Index Settings in the Administrator's Guide.

This combination will provide accurate and fast wildcard queries for a wide variety of wildcard searches, including leading and trailing wildcarded searches. If you add the trailing wildcard searches index, you will get slightly more efficient trailing wildcard searches, but with increased database size.

If you only need wildcards against specific XML elements, XML attributes, JSON properties, or fields, you should consider using an element or field word lexicon instead of a general word lexicon. Doing so can improve the speed and accuracy of wildcard matching. Consider this option if you're primarily performing wildcard searches using the following query types or their equivalent:

Understanding the Wildcard Indexes

You configure the index settings at the database level, using the Admin Interface or Admin APIs (XQuery, Server-Side JavaScript, or REST). For details on configuring database settings and on other text indexes, see Database Settings and Text Indexing in the Administrator's Guide.

The following database settings can affect the performance and accuracy of wildcard searches. For details, see Understanding the Text Index Settings in the Administrator's Guide.

  • word lexicons
  • element, element attribute, and field word lexicons. (Use an element word lexicon for a JSON property).
  • three character searches, two character searches, or one character searches. You do not need one or two character searches if three character searches is enabled.
  • three character word positions
  • trailing wildcard searches, trailing wildcard word positions, fast element trailing wildcard searches
  • fast element character searches

The three character searches index combined with the word lexicon provides the best performance for most queries, and the fast element character searches index is useful when you submit element queries. One and two character searches indexes are only used if you submit wildcard searches that try and match only one or two characters and you do not have the combination of a word lexicon and the three character searches index. Because one and two character searches generally return a large number of matches, they might not justify the disk space and load time trade-offs.

If you have the three character searches index enabled and two and one character indexes disabled, and if you have no word lexicon, it is still possible to issue a wildcard query that searches for a two or one character stem (for example, ab* or a*); these searches are allowed, but will not be fast. If you have a search user interface that allows users to enter such queries, you might want to check for these two or one character wildcard search patterns and issue an error, as these searches without the corresponding indexes can be slow and resource-intensive. Alternatively, add a codepoint collation word lexicon to your database.

As with all indexing, choosing which indexes to use is a trade-off. Enabling more indexes provides improved query performance, but uses more disk space and increases load and reindexing time. For most environments where wildcard searches are required, MarkLogic recommends enabling the three character searches and a codepoint collation word lexicon, but disabling one and two character searches.

If you only need to perform wildcard searches on specific elements, attributes, JSON properties, or fields, you can save some space and potentially improve accuracy by using an element, attribute, or field word lexicon instead of a general word lexicon.

Also, if you just want to apply wildcard searches to selected content, fields enable you to leave the wildcard indexes disabled at the database level, while still enabling them at the field level. For details, see Understanding Field Configurations in the Administrator's Guide.

Interaction with Other Search Features

This section describes the interactions between wildcard, stemming, and other search features in MarkLogic Server. The following topics are included:

Wildcarding and Stemming

Wildcard searches can be used in combination with stemming (for details on stemming, see Understanding and Using Stemmed Searches); that is, queries can perform stemmed searches and wildcard searches at the same time. However, the system will not perform a stemmed search on words that are wildcarded. For example, assume a search phrase of running car*. The term running will be matched based on its stem. However, car* will be matched based on a wildcard search, and will match car, cars, carriage, carpenter and so on; stemmed word matches for the words matching the wildcard are not returned.

Wildcarding and Punctuation Sensitivity

Stemming and punctuation sensitivity perform independently of each other. However, there is an interaction between wildcarding and punctuation sensitivity. This section describes this interaction and includes the following parts:

Implicitly and Explicitly Specifying Punctuation

MarkLogic Server allows you to explicitly specify whether a query is punctuation sensitive and whether it uses wildcards. You specify this in the options for the query, as in the following example:

cts:search(fn:doc(), cts:word-query("hello!", "punctuation-sensitive") ) 

If you include a wildcard character in a punctuation sensitive search, it will treat the wildcard as punctuation. For example, the following query matches hello*, but not hellothere:

cts:search(fn:doc(), cts:word-query("hello*", "punctuation-sensitive") ) 

If the punctuation sensitivity option is left unspecified, the system performs a punctuation sensitive search if there is any non-wildcard punctuation in the query terms. For example, if punctuation is not specified, the following query:

cts:search(fn:doc(), cts:word-query("hello!") ) 

will result in a punctuation sensitive search, and the following query:

cts:search(fn:doc(), cts:word-query("hello") ) 

will result in a punctuation insensitive search.

If a search is punctuation sensitive (whether implicitly or explicitly), MarkLogic Server will match the punctuation as well as the search term. Note that punctuation is not considered to be part of a word. For example, mark! is considered to be a word mark next to an exclamation point. If a search is punctuation insensitive, punctuation will match spaces.

Rules for Punctuation and Wildcarding Interaction

The characters ? and * are considered punctuation in documents loaded into the database. However, ? and * are also treated as wildcard characters in a query. This makes for interesting (and occasionally confusing) interaction between wildcarding and punctuation sensitivity.

The following are the rules for the interaction between punctuation and wildcarding. They will help you determine how the system behaves when there are interactions between the punctuation and wildcard characters.

  1. When wildcard indexes are disabled in the database, all queries default to "unwildcarded", and wildcard characters are treated as punctuation. If you specify "wildcarded" in the query, the query is a wildcard query and wildcard characters are treated as wildcards.
  2. Wildcarding trumps (has precedence over) punctuation sensitivity. That is, if the * and/or ? characters are present in a query, * and ? are treated as wildcards and not punctuation unless wildcarding is turned off. If wildcarding is turned off in the query ("unwildcarded"), they are treated as punctuation.
  3. If wildcarding and punctuation sensitivity are both explicitly off and punctuation characters (including * and ?) are in the query, they are treated as spaces.
  4. Wildcarding and punctuation sensitivity can be on at the same time. In this case, punctuation in a document is treated as characters, and wildcards in the query will match any character in the query, including punctuation characters. Therefore, the following query will match both hello* and hellothere:
    cts:search(fn:doc(), 
               cts:word-query("hello*", 
                               ("punctuation-sensitive", "wildcarded") ) 
               ) 
Examples of Wildcard and Punctuation Interactions

This section contains examples of the output of queries in the following categories:

Wildcarding and Punctuation Sensitivity Not Specified (Wildcard Indexes Enabled)

The following examples show queries that are run when at least one wildcard index is enabled and no options are explicitly set on the cts:word-query.

  • Example query: cts:word-query("hello world")

    Actual behavior: Wildcarding off, punctuation insensitive

    Will match: hello world, hello ?! world, hello? world! and so on

  • Example query: cts:word-query("hello?world")

    Actual behavior: Wildcarding on, punctuation insensitive

    Will match: helloaworld

    Will not match: hello world, hello!world

  • Example query: cts:word-query("hello*world")

    Actual behavior: Wildcarding on, punctuation insensitive

    Will match: helloabcworld

    Will not match: hello to world, hello-to-world

  • Example query: cts:word-query("hello * world")

    Actual behavior: Wildcarding on, punctuation insensitive

    Will match: hello to world, hello-to-world

    Will not match: helloaworld, hello world, hello ! world

    Adjacent spaces are collapsed for string comparisons in the server. In the query phrase hello * world, the two spaces on each side of the asterisk are not collapsed for comparison since they are not adjacent to each other. Therefore, hello world is not a match since there is only a single space between hello and world but hello * world requires two spaces because the spaces were not collapsed. The phrase hello ! world is also not a match because ! is treated as a space (punctuation insensitive), and then all three consecutive spaces are collapsed to a single space before the string comparison.

  • Example query: cts:word-query("hello! world")

    Actual behavior: Wildcarding off, punctuation sensitive

    Will match: hello! world

    Will not match: hello world, hello; world

  • Example query: cts:word-query("hey! world?")

    Actual behavior: Wildcarding on, punctuation sensitive

    Will match: hey! world?, hey! world!, hey! worlds

    Will not match: hey. world

Wildcarding Explicitly Off, Punctuation Sensitivity Not Specified

The following examples show the matches for queries that specify "unwildcarded" and do not specify anything about punctuation-sensitivity.

  • Example query: cts:word-query("hello?world", "unwildcarded")

    Actual behavior: Wildcarding off, punctuation sensitive

    Will match: hello?world

    Will not match: hello world, hello;world

  • Example query: cts:word-query("hello*world", "unwildcarded")

    Actual behavior: Wildcarding off, punctuation sensitive

    Will match: hello*world

    Will not match: helloabcworld

Wildcarding Not Specified, Punctuation Sensitivity Explicitly On (Wildcard Indexes Enabled)

The following examples show queries that are run when at least one wildcard index is enabled and the "punctuation-sensitive" option is explicitly set on the cts:word-query.

  • Example query: cts:word-query("hello?world", "punctuation-sensitive")

    Actual behavior: Wildcarding on, punctuation sensitive

    Will match: hello?world, hello.world, hello*world

    Will not match: hello world, hello ! world

  • Example query: cts:word-query("hello * world", "punctuation-sensitive")

    Actual behavior: Wildcarding on, punctuation sensitive

    Will match: hello abc world, hello ! world

    Will not match: hello-!- world

  • Example query: cts:word-query("hello? world", "punctuation-sensitive")

    Actual behavior: Wildcarding on, punctuation sensitive

    Will match: hello! world, (hello) world

    (hello) world is a match because ? matches ) and ( is not considered part of the word hello.

    Will not match: ahello) world, hello to world.

« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy