This chapter describes wildcard searches in MarkLogic Server. The following sections are included:
Wildcard searches enable MarkLogic Server to return results that match combinations of characters and wildcards. Wildcard searches are not simply exact string matches, but are based on character pattern matching between the characters specified in a query and words in documents that contain those character patterns. This section describes wildcards and includes the following topics:
MarkLogic Server supports two wildcard characters: *
and ?
.
For example, he*
will match any word starting with he,
such as he
, her
, help
, hello
, helicopter
, and so on. On the other hand, he?
will only match three-letter words starting with he
, such as hem
, hen
, and so on.
The following are the basic rules for wildcard searches in MarkLogic Server:
m*??
will match words starting with m
with three or more characters.m*th*
will match me
thod
but not meet
there
.*
wildcard is specified by itself in a value query (for example, cts:element-value-query, cts:element-value-match), it matches everything (spanning word breaks). For example, *
will match the value meet me there
. *
wildcard is specified with a non-wildcard character, it will match in value lexicon queries (for example, cts:element-value-match), but will not match in value queries (for example, cts:element-value-query). For example, m*
will match the value meet me there
for a value lexicon search (for example, cts:element-value-match) but will not match the value for a value query search (for example, cts:element-value-query), because the value query only matches the one word. A value search for m* *
will match the value (because m*
matches the first word and *
matches everything after it). "wildcarded"
is explicitly specified in the cts:query
expression, then the search is performed as a wildcard search."wildcarded"
nor "unwildcarded"
is specified in the cts:query
expression, the database configuration and query text determine wildcarding. If the database has any wildcard indexes enabled (three character searches
, two character searches
, one character searches
, or trailing wildcard searches
) and if the query text contains either of the wildcard characters ?
or *
, then the wildcard characters are treated as wildcards and the search is performed "wildcarded"
. If none of the wildcard indexes are enabled, the wildcard characters are treated as punctuation and the search is performed unwildcarded
(unless "wildcarded"
is specified in the cts:query
expression).punctuation-sensitive
option, then punctuation is treated as word characters for wildcard searches. For example, a punctuation-sensitive
wildcard search for d*benz
would match daimler-benz
.whitespace-sensitive
option, then whitespace is treated as word characters. This can be useful for matching spaces in wildcarded value queries. You can use the whitespace-sensitive
option in wildcarded word queries, too, although it might not make much sense, as it will match more than you might expect.Wildcard searches use character indexes, lexicons, and trailing wildcard indexes to speed performance. To ensure that wildcard searches are fast, you should enable at least one wildcard index (three character searches, trailing wildcard searches, two character searches, and/or one character searches) and fast element character searches (if you want fast searches within specific elements) in the Admin Interface database configuration screen. Wildcard searches are disabled by default. If you enable character indexes, you should plan on allocating an additional amount of disk space approximately three times the size of the source content.
This section describes the following topics:
If any wildcard indexes are enabled for the database, you can further control the use of wildcards at the query level. You can use wildcards with any of the MarkLogic cts:query
leaf-level functions, such as cts:word-query, cts:element-word-query, and cts:element-value-query
. For details on the cts:query
functions, see Composing cts:query Expressions. You can use the "wildcarded"
and "unwildcarded"
query option to turn wildcarding on or off explicitly in the cts:query
constructor functions. See the MarkLogic XQuery and XSLT Function Reference for more details.
If you leave the wildcard option unspecified and there are any wildcard indexes enabled, MarkLogic Server will perform a wildcard query if *
or ?
is present in the query. For example, the following search function:
cts:search(fn:doc(), cts:word-query("he*"))
will result in a wildcard search. Therefore, as long as any wildcard indexes are enabled in the database, you do not have to turn on wildcarding explicitly to perform wildcard searches.
When wildcard indexing is enabled in the database, the system will also deliver higher performance for fn:contains
, fn:matches, fn:starts-with and fn:ends-with for most query expressions.
If character indexes, lexicons, and trailing wildcard indexes are all disabled in a database and wildcarding is explicitly enabled in the query (with the "wildcarded"
option to the leaf-level cts:query
constructor), the query will execute, but might require a lot of processing. Such queries will be fast if they are very selective and only need to do the wildcard searches over a relatively small amount of content, but can take a long time if they actually need to filter out results from a large amount of content.
To enable any kind of wildcard query functionality with a good combination of performance and database size, MarkLogic recommends you enable the following index settings:
word searches
three character word searches
word positions
word lexicon
in the codepoint collationthree character word positions
For details, see Understanding the Wildcard Indexes and Understanding the Text Index Settings in the Administrator's Guide.
This combination will provide accurate and fast wildcard queries for a wide variety of wildcard searches, including leading and trailing wildcarded searches. If you add the trailing wildcard searches
index, you will get slightly more efficient trailing wildcard searches, but with increased database size.
If you only need wildcards against specific XML elements, XML attributes, JSON properties, or fields, you should consider using an element or field word lexicon instead of a general word lexicon. Doing so can improve the speed and accuracy of wildcard matching. Consider this option if you're primarily performing wildcard searches using the following query types or their equivalent:
You configure the index settings at the database level, using the Admin Interface or Admin APIs (XQuery, Server-Side JavaScript, or REST). For details on configuring database settings and on other text indexes, see Database Settings and Text Indexing in the Administrator's Guide.
The following database settings can affect the performance and accuracy of wildcard searches. For details, see Understanding the Text Index Settings in the Administrator's Guide.
three character searches
, two character searches
, or one character searches
. You do not need one or two character searches if three character searches is enabled.three character word positions
trailing wildcard searches
, trailing wildcard word positions
, fast element trailing wildcard searches
fast element character searches
The three character searches
index combined with the word lexicon provides the best performance for most queries, and the fast element character searches
index is useful when you submit element queries. One and two character searches indexes are only used if you submit wildcard searches that try and match only one or two characters and you do not have the combination of a word lexicon and the three character searches
index. Because one and two character searches generally return a large number of matches, they might not justify the disk space and load time trade-offs.
If you have the three character searches
index enabled and two and one character indexes disabled, and if you have no word lexicon, it is still possible to issue a wildcard query that searches for a two or one character stem (for example, ab*
or a*
); these searches are allowed, but will not be fast. If you have a search user interface that allows users to enter such queries, you might want to check for these two or one character wildcard search patterns and issue an error, as these searches without the corresponding indexes can be slow and resource-intensive. Alternatively, add a codepoint collation word lexicon to your database.
As with all indexing, choosing which indexes to use is a trade-off. Enabling more indexes provides improved query performance, but uses more disk space and increases load and reindexing time. For most environments where wildcard searches are required, MarkLogic recommends enabling the three character searches
and a codepoint collation word lexicon, but disabling one and two character searches.
If you only need to perform wildcard searches on specific elements, attributes, JSON properties, or fields, you can save some space and potentially improve accuracy by using an element, attribute, or field word lexicon instead of a general word lexicon.
Also, if you just want to apply wildcard searches to selected content, fields enable you to leave the wildcard indexes disabled at the database level, while still enabling them at the field level. For details, see Understanding Field Configurations in the Administrator's Guide.
This section describes the interactions between wildcard, stemming, and other search features in MarkLogic Server. The following topics are included:
Wildcard searches can be used in combination with stemming (for details on stemming, see Understanding and Using Stemmed Searches); that is, queries can perform stemmed searches and wildcard searches at the same time. However, the system will not perform a stemmed search on words that are wildcarded. For example, assume a search phrase of running car*
. The term running
will be matched based on its stem. However, car*
will be matched based on a wildcard search, and will match car
, cars
, carriage
, carpenter
and so on; stemmed word matches for the words matching the wildcard are not returned.
Stemming and punctuation sensitivity perform independently of each other. However, there is an interaction between wildcarding and punctuation sensitivity. This section describes this interaction and includes the following parts:
MarkLogic Server allows you to explicitly specify whether a query is punctuation sensitive and whether it uses wildcards. You specify this in the options for the query, as in the following example:
cts:search(fn:doc(), cts:word-query("hello!", "punctuation-sensitive") )
If you include a wildcard character in a punctuation sensitive search, it will treat the wildcard as punctuation. For example, the following query matches hello*
, but not hellothere
:
cts:search(fn:doc(), cts:word-query("hello*", "punctuation-sensitive") )
If the punctuation sensitivity option is left unspecified, the system performs a punctuation sensitive search if there is any non-wildcard punctuation in the query terms. For example, if punctuation is not specified, the following query:
cts:search(fn:doc(), cts:word-query("hello!") )
will result in a punctuation sensitive search, and the following query:
cts:search(fn:doc(), cts:word-query("hello") )
will result in a punctuation insensitive search.
If a search is punctuation sensitive (whether implicitly or explicitly), MarkLogic Server will match the punctuation as well as the search term. Note that punctuation is not considered to be part of a word. For example, mark!
is considered to be a word mark
next to an exclamation point. If a search is punctuation insensitive, punctuation will match spaces.
The characters ?
and *
are considered punctuation in documents loaded into the database. However, ?
and *
are also treated as wildcard characters in a query. This makes for interesting (and occasionally confusing) interaction between wildcarding and punctuation sensitivity.
The following are the rules for the interaction between punctuation and wildcarding. They will help you determine how the system behaves when there are interactions between the punctuation and wildcard characters.
"unwildcarded"
, and wildcard characters are treated as punctuation. If you specify "wildcarded"
in the query, the query is a wildcard query and wildcard characters are treated as wildcards.*
and/or ?
characters are present in a query, *
and ?
are treated as wildcards and not punctuation unless wildcarding is turned off. If wildcarding is turned off in the query ("unwildcarded"
), they are treated as punctuation.*
and ?
) are in the query, they are treated as spaces.hello*
and hellothere
:cts:search(fn:doc(),
cts:word-query("hello*",
("punctuation-sensitive", "wildcarded") )
)
This section contains examples of the output of queries in the following categories:
The following examples show queries that are run when at least one wildcard index is enabled and no options are explicitly set on the cts:word-query.
cts:word-query("hello world")
Actual behavior: Wildcarding off, punctuation insensitive
Will match: hello world
, hello ?! world
, hello? world!
and so on
cts:word-query("hello?world")
cts:word-query("hello*world")
cts:word-query("hello * world")
Actual behavior: Wildcarding on, punctuation insensitive
Will match: hello to world
, hello-to-world
Will not match: helloaworld
, hello world
, hello ! world
Adjacent spaces are collapsed for string comparisons in the server. In the query phrase hello * world
, the two spaces on each side of the asterisk are not collapsed for comparison since they are not adjacent to each other. Therefore, hello world
is not a match since there is only a single space between hello
and world
but hello * world
requires two spaces because the spaces were not collapsed. The phrase hello ! world
is also not a match because !
is treated as a space (punctuation insensitive), and then all three consecutive spaces are collapsed to a single space before the string comparison.
cts:word-query("hello! world")
cts:word-query("hey! world?")
Actual behavior: Wildcarding on, punctuation sensitive
The following examples show the matches for queries that specify "unwildcarded"
and do not specify anything about punctuation-sensitivity.
The following examples show queries that are run when at least one wildcard index is enabled and the "punctuation-sensitive"
option is explicitly set on the cts:word-query.
cts:word-query("hello?world", "punctuation-sensitive")
Actual behavior: Wildcarding on, punctuation sensitive
cts:word-query("hello * world", "punctuation-sensitive")
Actual behavior: Wildcarding on, punctuation sensitive
cts:word-query("hello? world", "punctuation-sensitive")
Actual behavior: Wildcarding on, punctuation sensitive
Will match: hello! world
, (hello) world
(hello) world
is a match because ?
matches )
and (
is not considered part of the word hello
.