Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 18

Understanding and Using Stemmed Searches

This chapter describes how to use the stemmed search functionality in MarkLogic Server. The following sections are included:

Stemming in MarkLogic Server

MarkLogic Server supports stemming in English and other languages. For a list of languages in which stemming is supported, see Supported Languages.

If stemmed searches are enabled in the database configuration, MarkLogic Server automatically searches for words that come from the same stem of the word specified in the query, not just the exact string specified in the query. A stemmed search for a word finds the exact same terms as well as terms that derive from the same meaning and part of speech as the search term. The stem of a word is not based on spelling. For example, card and cardiac have different stems even though the spelling of cardiac begins with card. On the other hand, running and ran have the same stem (run) even though their spellings are quite different. If you want to search for a word based on partial pattern matching (like the card and cardiac example above), use wildcard searches as described in Understanding and Using Wildcard Searches.

Stemming enables the search to return highly relevant matches that would otherwise be missed. For example, the terms run, running, and ran all have the same stem. Therefore, a stemmed search for the term run returns results that match elements containing the terms running, ran, and runs, as well as results for the specified term run.

The stemming supported in MarkLogic Server does not cross different parts of speech. For example, conserve (verb) and conservation (noun) are not considered to have the same stem because they have different parts of speech. Consequently, if you search for conserve with stemmed searches enabled, the results will include documents containing conserve and conserves, but not documents with conservation (unless conserve or conserves also appears).

Stemming is language-specific, that is, each word is treated to be in the specified language. The language can be specified with an xml:lang attribute or by several other methods, and a term in one language will not match a stemmed search for the same term in another language. For details on how languages affect queries, see Querying Documents By Languages.

Enabling Stemming

To use stemming in your searches, stemming must be enabled in your database configuration. All new databases created in MarkLogic Server have stemming enabled by default. Stemmed searches require indexes which are created at load, update, or reindex time. If you enable stemming in an existing database that did not previously have stemming enabled, you must either reload or reindex the database to ensure that you get stemmed results from searches. You should plan on allocating an additional amount of disk space about twice the size of the source content if you enable stemming.

There are three types of stemming available in MarkLogic Server: basic, advanced, and decompounding. The following table describes the stemming options available on the database configuration page of the Admin Interface.

Stemming Option Description
OFF No words are indexed for stemming.
Basic This is the default. Each word is indexed to a single stem.
Advanced Each word is indexed to one or more stems. Some words can have two or more meanings, and can therefore have multiple stems. For example, the word further stems to further (as in he attended the party to further his career) and it stems to far (as in she was further along in her studies than he).
Decompounding All stems for each word are indexed, and smaller component words of large compound words are also indexed. Mostly used in languages such as German that use compound words.

If stemming is enabled for the database, you can further control the use of stemming at the query level. Stemming can be used with any of the MarkLogic cts:query constructor functions, such as cts:word-query, cts:element-word-query, and cts:element-value-query. Stemming options, "stemmed" or "unstemmed", can be specified in the options parameter to the cts:query constructor. For more details on these functions, see the MarkLogic XQuery and XSLT Function Reference (http://docs.marklogic.com).

Query terms that contain a wildcard will not be stemmed. If you leave the stemming option unspecified, the system will perform a stemmed search for any word that does not contain a wildcard. Therefore, as long as stemming is enabled in the database, you do not have to enable stemming explicitly in a query.

If stemming is turned off in the database, and stemming is explicitly specified in the query, the query will throw an error.

Stemmed Searches Versus Word Searches

The stemmed search indexes and word search (unstemmed) indexes have overlapping functionality, and there is a good chance you can get the results you want with only the stemmed search indexes enabled (that is, leaving the word search indexes turned off).

Stemmed searches return relevance-ranked results for the words you search for as well as for words with the same stem as the words you search for. Therefore, you will get the same results as with a word search plus the results for items containing words with the same stem. In most search applications, this is the desirable behavior.

The only time you need to also have word search indexes enabled is when your application requires an exact word search to only return the exact match results (that is, to not return results based on stemming).

Additionally, the stemmed search indexes take up less disk space than the word search (unstemmed) indexes. You can therefore save some disk space and decrease load time when you use the default settings of stemmed search enabled and word search turned off in the database configuration. Every index has a cost in terms of disk space used and increased load times. You have to decide based on your application requirements if the cost of creating extra indexes is worthwhile for your application, and whether you can fulfill the same requirements without some of the indexes.

If you do need to perform word (unstemmed) searches when you only have stemmed search indexes enabled (that is, when word searches are turned off in the database configuration), you must do so by first doing a stemmed search and then filtering the results with an unstemmed cts:query, as described in Unstemmed Searches.

Using cts:highlight or cts:contains to Find if a Word Matches a Query

Because stemming enables query matches for terms that do not have the same spelling, it can sometimes be difficult to find the words that actually caused the query to match. You can use cts:highlight to test and/or highlight the words that actually matched the query. For details on cts:highlight, see the MarkLogic XQuery and XSLT Function Reference and Highlighting Search Term Matches.

You can also use cts:contains to test if a word matches the query. The cts:contains function returns true if there is a match, false if there is no match. For example, you can use the following function to test if a word has the same stem as another word.

xquery version "1.0-ml";
declare function local:same-stem(
  $word1 as xs:string, $word2 as xs:string) 
  as xs:boolean 
{
  cts:contains(text{$word1},$word2)
};

(: The following returns true because 
   running has the same stem as run :)
local:same-stem("run", "running") 

Interaction With Wildcard Searches

For information about how stemmed searches and Wildcard searches interact, see Interaction with Other Search Features.

« Previous chapter
Next chapter »