Wildcard searches enable MarkLogic Server to return results that match combinations of characters and wildcards. Wildcard searches are not simply exact string matches, but are based on character pattern matching between the characters specified in a query and words in documents that contain those character patterns. This section describes wildcards and includes the following topics:
he* will match any word starting with
he, such as
helicopter, and so on. On the other hand,
he? will only match three-letter words starting with
he, such as
hen, and so on.
m*??will match words starting with
mwith three or more characters.
m*th*will match m
odbut not m
*wildcard is specified by itself in a value query (for example, cts:element-value-query, cts:element-value-match), it matches everything (spanning word breaks). For example,
*will match the value
meet me there.
*wildcard is specified with a non-wildcard character, it will match in value lexicon queries (for example, cts:element-value-match), but will not match in value queries (for example, cts:element-value-query). For example,
m*will match the value
meet me therefor a value lexicon search (for example, cts:element-value-match) but will not match the value for a value query search (for example, cts:element-value-query), because the value query only matches the one word. A value search for
m* *will match the value (because
m*matches the first word and
*matches everything after it).
"wildcarded"is explicitly specified in the
cts:queryexpression, then the search is performed as a wildcard search.
"unwildcarded"is specified in the
cts:queryexpression, the database configuration and query text determine wildcarding. If the database has any wildcard indexes enabled (
three character searches,
two character searches,
one character searches, or
trailing wildcard searches) and if the query text contains either of the wildcard characters
*, then the wildcard characters are treated as wildcards and the search is performed
"wildcarded". If none of the wildcard indexes are enabled, the wildcard characters are treated as punctuation and the search is performed
"wildcarded"is specified in the
punctuation-sensitiveoption, then punctuation is treated as word characters for wildcard searches. For example, a
punctuation-sensitivewildcard search for
whitespace-sensitiveoption, then whitespace is treated as word characters. This can be useful for matching spaces in wildcarded value queries. You can use the
whitespace-sensitiveoption in wildcarded word queries, too, although it might not make much sense, as it will match more than you might expect.
Wildcard searches use character indexes, lexicons, and trailing wildcard indexes to speed performance. To ensure that wildcard searches are fast, you should enable at least one wildcard index (three character searches, trailing wildcard searches, two character searches, and/or one character searches) and fast element character searches (if you want fast searches within specific elements) in the Admin Interface database configuration screen. Wildcard are disabled by default. If you enable character indexes, you should plan on allocating an additional amount of disk space approximately three times the size of the source content.
If any wildcard indexes are enabled for the database, you can further control the use of wildcards at the query level. You can use wildcards with any of the MarkLogic
cts:query leaf-level functions, such as cts:word-query, cts:element-word-query
, and cts:element-value-query. For details on the
cts:query functions, see Composing cts:query Expressions. You can use the
"unwildcarded" query option to turn wildcarding on or off explicitly in the
cts:query constructor functions. See the MarkLogic XQuery and XSLT Function Reference for more details.
If you leave the wildcard option unspecified and there are any wildcard indexes enabled, MarkLogic Server will perform a wildcard query if
? is present in the query. For example, the following search function:
If character indexes, lexicons, and trailing wildcard indexes are all disabled in a database and wildcarding is explicitly enabled in the query (with the
"wildcarded" option to the leaf-level
cts:query constructor), the query will execute, but might require a lot of processing. Such queries will be fast if they are very selective and only need to do the wildcard searches over a relatively small amount of content, but can take a long time if they actually need to filter out results from a large amount of content.
three character word searches
word lexiconin the codepoint collation
three character word positions
This combination will provide accurate and fast wildcard queries for a wide variety of wildcard searches, including leading and trailing wildcarded searches. If you add the
trailing wildcard searches index, you will get slightly more efficient trailing wildcard searches, but with increased database size.
If you only need wildcards against specific XML elements, XML attributes, JSON properties, or fields, you should consider using an element or field word lexicon instead of a general word lexicon. Doing so can improve the speed and accuracy of wildcard matching. Consider this option if you're primarily performing wildcard searches using the following query types or their equivalent:
three character searches,
two character searches, or
one character searches. You do not need one or two character searches if three character searches is enabled.
three character word positions
trailing wildcard searches,
trailing wildcard word positions,
fast element trailing wildcard searches
fast element character searches
three character searches index combined with the word lexicon provides the best performance for most queries, and the
fast element character searches index is useful when you submit element queries. One and two character searches indexes are only used if you submit wildcard searches that try and match only one or two characters and you do not have the combination of a word lexicon and the
three character searches index. Because one and two character searches generally return a large number of matches, they might not justify the disk space and load time trade-offs.
If you have the
three character searches index enabled and two and one character indexes disabled, and if you have no word lexicon, it is still possible to issue a wildcard query that searches for a two or one character stem (for example,
a*); these searches are allowed, but will not be fast. If you have a search user interface that allows users to enter such queries, you might want to check for these two or one character wildcard search patterns and issue an error, as these searches without the corresponding indexes can be slow and resource-intensive. Alternatively, add a codepoint collation word lexicon to your database.
As with all indexing, choosing which indexes to use is a trade-off. Enabling more indexes provides improved query performance, but uses more disk space and increases load and reindexing time. For most environments where wildcard searches are required, MarkLogic recommends enabling the
three character searches and a codepoint collation word lexicon, but disabling one and two character searches.
If you only need to perform wildcard searches on specific elements, attributes, JSON properties, or fields, you can save some space and potentially improve accuracy by using an element, attribute, or field word lexicon instead of a general word lexicon.
Also, if you just want to apply wildcard searches to selected content, fields enable you to leave the wildcard indexes disabled at the database level, while still enabling them at the field level. For details, see Understanding Field Configurations in the Administrator's Guide.
Wildcard searches can be used in combination with stemming (for details on stemming, see Understanding and Using Stemmed Searches); that is, queries can perform stemmed searches and wildcard searches at the same time. However, the system will not perform a stemmed search on words that are wildcarded. For example, assume a search phrase of
running car*. The term
running will be matched based on its stem. However,
car* will be matched based on a wildcard search, and will match
carpenter and so on; stemmed word matches for the words matching the wildcard are not returned.
Stemming and punctuation sensitivity perform independently of each other. However, there is an interaction between wildcarding and punctuation sensitivity. This section describes this interaction and includes the following parts:
If the punctuation sensitivity option is left unspecified, the system performs a punctuation sensitive search if there is any non-wildcard punctuation in the query terms. For example, if punctuation is not specified, the following query:
If a search is punctuation sensitive (whether implicitly or explicitly), MarkLogic Server will match the punctuation as well as the search term. Note that punctuation is not considered to be part of a word. For example,
mark! is considered to be a word
mark next to an exclamation point. If a search is punctuation insensitive, punctuation will match spaces.
* are considered punctuation in documents loaded into the database. However,
* are also treated as wildcard characters in a query. This makes for interesting (and occasionally confusing) interaction between wildcarding and punctuation sensitivity.
The following are the rules for the interaction between punctuation and wildcarding. They will help you determine how the system behaves when there are interactions between the punctuation and wildcard characters.
"unwildcarded", and wildcard characters are treated as punctuation. If you specify
"wildcarded"in the query, the query is a wildcard query and wildcard characters are treated as wildcards.
?characters are present in a query,
?are treated as wildcards and not punctuation unless wildcarding is turned off. If wildcarding is turned off in the query (
"unwildcarded"), they are treated as punctuation.
?) are in the query, they are treated as spaces.
The following examples show queries that are run when at least one wildcard index is enabled and no options are explicitly set on the cts:word-query.
cts:word-query("hello * world")
Adjacent spaces are collapsed for string comparisons in the server. In the query phrase
hello * world, the two spaces on each side of the asterisk are not collapsed for comparison since they are not adjacent to each other. Therefore,
hello world is not a match since there is only a single space between
hello * world requires two spaces because the spaces were not collapsed. The phrase
hello ! world is also not a match because
! is treated as a space (punctuation insensitive), and then all three consecutive spaces are collapsed to a single space before the string comparison.
The following examples show queries that are run when at least one wildcard index is enabled and the
"punctuation-sensitive" option is explicitly set on the cts:word-query.
cts:word-query("hello * world", "punctuation-sensitive")
cts:word-query("hello? world", "punctuation-sensitive")