This chapter describes how to use the stemmed search functionality in MarkLogic Server. The following sections are included:
MarkLogic Server supports stemming in English and other languages. For a list of languages in which stemming is supported, see Supported Languages.
If stemmed searches are enabled in the database configuration, MarkLogic Server automatically searches for words that come from the same stem of the word specified in the query, not just the exact string specified in the query. A stemmed search for a word finds the exact same terms as well as terms that derive from the same meaning and part of speech as the search term. The stem of a word is not based on spelling. For example, card
and cardiac
have different stems even though the spelling of cardiac
begins with card
. On the other hand, running
and ran
have the same stem (run
) even though their spellings are quite different. If you want to search for a word based on partial pattern matching (like the card
and cardiac
example above), use wildcard searches as described in Understanding and Using Wildcard Searches.
Stemming enables the search to return highly relevant matches that would otherwise be missed. For example, the terms run
, running
, and ran
all have the same stem. Therefore, a stemmed search for the term run
returns results that match elements containing the terms running
, ran
, and runs
, as well as results for the specified term run
.
The stemming supported in MarkLogic Server does not cross different parts of speech. For example, conserve
(verb) and conservation
(noun) are not considered to have the same stem because they have different parts of speech. Consequently, if you search for conserve
with stemmed searches enabled, the results will include documents containing conserve
and conserves
, but not documents with conservation
(unless conserve
or conserves
also appears).
Stemming is language-specific, that is, each word is treated to be in the specified language. The language can be specified with an xml:lang
attribute or by several other methods, and a term in one language will not match a stemmed search for the same term in another language. For details on how languages affect queries, see Querying Documents By Languages.
To use stemming in your searches, stemming must be enabled in your database configuration. All new databases created in MarkLogic Server have stemming enabled by default. Stemmed searches require indexes which are created at load, update, or reindex time. If you enable stemming in an existing database that did not previously have stemming enabled, you must either reload or reindex the database to ensure that you get stemmed results from searches. You should plan on allocating an additional amount of disk space about twice the size of the source content if you enable stemming.
There are three types of stemming available in MarkLogic Server: basic, advanced, and decompounding. The following table describes the stemming options available on the database configuration page of the Admin Interface.
If stemming is enabled for the database, you can further control the use of stemming at the query level. Stemming can be used with any of the MarkLogic cts:query
constructor functions, such as cts:word-query
, cts:element-word-query, and cts:element-value-query
. Stemming options, "stemmed"
or "unstemmed"
, can be specified in the options parameter to the cts:query
constructor. For more details on these functions, see the MarkLogic XQuery and XSLT Function Reference (http://docs.marklogic.com).
Query terms that contain a wildcard will not be stemmed. If you leave the stemming option unspecified, the system will perform a stemmed search for any word that does not contain a wildcard. Therefore, as long as stemming is enabled in the database, you do not have to enable stemming explicitly in a query.
If stemming is turned off in the database, and stemming is explicitly specified in the query, the query will throw an error.
The stemmed search indexes and word search (unstemmed) indexes have overlapping functionality, and there is a good chance you can get the results you want with only the stemmed search indexes enabled (that is, leaving the word search indexes turned off).
Stemmed searches return relevance-ranked results for the words you search for as well as for words with the same stem as the words you search for. Therefore, you will get the same results as with a word search plus the results for items containing words with the same stem. In most search applications, this is the desirable behavior.
The only time you need to also have word search indexes enabled is when your application requires an exact word search to only return the exact match results (that is, to not return results based on stemming).
Additionally, the stemmed search indexes take up less disk space than the word search (unstemmed) indexes. You can therefore save some disk space and decrease load time when you use the default settings of stemmed search enabled and word search turned off in the database configuration. Every index has a cost in terms of disk space used and increased load times. You have to decide based on your application requirements if the cost of creating extra indexes is worthwhile for your application, and whether you can fulfill the same requirements without some of the indexes.
If you do need to perform word (unstemmed) searches when you only have stemmed search indexes enabled (that is, when word searches are turned off in the database configuration), you must do so by first doing a stemmed search and then filtering the results with an unstemmed cts:query
, as described in Unstemmed Searches.
Because stemming enables query matches for terms that do not have the same spelling, it can sometimes be difficult to find the words that actually caused the query to match. You can use cts:highlight
to test and/or highlight the words that actually matched the query. For details on cts:highlight, see the MarkLogic XQuery and XSLT Function Reference and Highlighting Search Term Matches.
You can also use cts:contains to test if a word matches the query. The cts:contains function returns true
if there is a match, false
if there is no match. For example, you can use the following function to test if a word has the same stem as another word.
xquery version "1.0-ml"; declare function local:same-stem( $word1 as xs:string, $word2 as xs:string) as xs:boolean { cts:contains(text{$word1},$word2) }; (: The following returns true because running has the same stem as run :) local:same-stem("run", "running")
For information about how stemmed searches and Wildcard searches interact, see Interaction with Other Search Features.