MarkLogic Server supports stemming in English and other languages. For a list of languages in which stemming is supported, see Supported Languages.
If stemmed searches are enabled in the database configuration, MarkLogic Server automatically searches for words that come from the same stem of the word specified in the query, not just the exact string specified in the query. A stemmed search for a word finds the exact same terms as well as terms that derive from the same meaning and part of speech as the search term. The stem of a word is not based on spelling. For example,
cardiac have different stems even though the spelling of
cardiac begins with
card. On the other hand,
ran have the same stem (
run) even though their spellings are quite different. If you want to search for a word based on partial pattern matching (like the
cardiac example above), use wildcard searches as described in Understanding and Using Wildcard Searches.
Stemming enables the search to return highly relevant matches that would otherwise be missed. For example, the terms
ran all have the same stem. Therefore, a stemmed search for the term
run returns results that match elements containing the terms
runs, as well as results for the specified term
The stemming supported in MarkLogic Server does not cross different parts of speech. For example,
conserve (verb) and
conservation (noun) are not considered to have the same stem because they have different parts of speech. Consequently, if you search for
conserve with stemmed searches enabled, the results will include documents containing
conserves, but not documents with
conserves also appears).
Stemming is language-specific, that is, each word is treated to be in the specified language. The language can be specified with an
xml:lang attribute or by several other methods, and a term in one language will not match a stemmed search for the same term in another language. For details on how languages affect queries, see Querying Documents By Languages.
To use stemming in your searches, stemming must be enabled in your database configuration. All new databases created in MarkLogic Server have stemming enabled by default. Stemmed searches require indexes which are created at load, update, or reindex time. If you enable stemming in an existing database that did not previously have stemming enabled, you must either reload or reindex the database to ensure that you get stemmed results from searches. You should plan on allocating an additional amount of disk space about twice the size of the source content if you enable stemming.
There are three types of stemming available in MarkLogic Server: basic, advanced, and decompounding. The following table describes the stemming options available on the database configuration page of the Admin Interface.
|OFF||No words are indexed for stemming.|
|Basic||This is the default. Each word is indexed to a single stem.|
|Advanced||Each word is indexed to one or more stems. Some words can have two or more meanings, and can therefore have multiple stems. For example, the word |
|Decompounding||All stems for each word are indexed, and smaller component words of large compound words are also indexed. Mostly used in languages such as German that use compound words.|
If stemming is enabled for the database, you can further control the use of stemming at the query level. Stemming can be used with any of the MarkLogic
cts:query constructor functions, such as
cts:word-query, cts:element-word-query, and
cts:element-value-query. Stemming options,
"unstemmed", can be specified in the options parameter to the
cts:query constructor. For more details on these functions, see the MarkLogic XQuery and XSLT Function Reference (http://docs.marklogic.com).
Query terms that contain a wildcard will not be stemmed. If you leave the stemming option unspecified, the system will perform a stemmed search for any word that does not contain a wildcard. Therefore, as long as stemming is enabled in the database, you do not have to enable stemming explicitly in a query.
The stemmed search indexes and word search (unstemmed) indexes have overlapping functionality, and there is a good chance you can get the results you want with only the stemmed search indexes enabled (that is, leaving the word search indexes turned off).
Stemmed searches return relevance-ranked results for the words you search for as well as for words with the same stem as the words you search for. Therefore, you will get the same results as with a word search plus the results for items containing words with the same stem. In most search applications, this is the desirable behavior.
The only time you need to also have word search indexes enabled is when your application requires an exact word search to only return the exact match results (that is, to not return results based on stemming).
Additionally, the stemmed search indexes take up less disk space than the word search (unstemmed) indexes. You can therefore save some disk space and decrease load time when you use the default settings of stemmed search enabled and word search turned off in the database configuration. Every index has a cost in terms of disk space used and increased load times. You have to decide based on your application requirements if the cost of creating extra indexes is worthwhile for your application, and whether you can fulfill the same requirements without some of the indexes.
If you do need to perform word (unstemmed) searches when you only have stemmed search indexes enabled (that is, when word searches are turned off in the database configuration), you must do so by first doing a stemmed search and then filtering the results with an unstemmed
cts:query, as described in Unstemmed Searches.
Because stemming enables query matches for terms that do not have the same spelling, it can sometimes be difficult to find the words that actually caused the query to match. You can use
cts:highlight to test and/or highlight the words that actually matched the query. For details on cts:highlight, see the MarkLogic XQuery and XSLT Function Reference and Highlighting Search Term Matches.
You can also use cts:contains to test if a word matches the query. The cts:contains function returns
true if there is a match,
false if there is no match. For example, you can use the following function to test if a word has the same stem as another word.