Search Developer's Guide — Chapter 28

Language Support in MarkLogic Server

MarkLogic Server supports loading and querying content in multiple languages. This chapter describes how languages are handled in MarkLogic Server, and includes the following sections:

Overview of Language Support in MarkLogic Server
Tokenization and Stemming
Language Aspects of Loading and Updating Documents
Querying Documents By Languages
Supported Languages
Generic Language Support

Overview of Language Support in MarkLogic Server

In MarkLogic Server, the language of the content is specified when you load the content and the language of the query is specified when you query the content. At load-time, the content is tokenized, indexed, and stemmed (if enabled) based on the language specified during the load. Also, MarkLogic Server uses any languages specified at the element level in the XML markup for the content (see xml:lang Attribute), making it possible to load documents with multiple languages. Similarly, at query time, search terms are tokenized (and stemmed) based on the language specified in the cts:query expression. The result is that a query performed in one language might not yield the same results as the same query performed in another language, as both the indexes that store the information about the content and the queries against the content are language-aware.

Even if your content is entirely in a single language, MarkLogic Server is still multiple-language aware. If your content is all in a single language, and if that language is the default language for that database, and if the content does not have any language (xml:lang) attributes, and if your queries all specify (or default to) the language in which the content is loaded, then everything will behave as if there is a single language.

Because MarkLogic Server is multiple-language aware, it is important to understand the fundamental aspects of languages when loading and querying content in MarkLogic Server. The remainder of this chapter, particularly Language Aspects of Loading and Updating Documents and Querying Documents By Languages, describe these details.

Tokenization and Stemming

To understand the language implications of querying and loading documents, you must first understand tokenization and stemming, which are both language-specific. This section describes these topics, and has the following parts:

Language-Specific Tokenization
Stemmed Searches in Different Languages

Language-Specific Tokenization

When you search for a string (typically a word or a phrase) in MarkLogic Server, or when you load content (which is made up of text strings) into MarkLogic Server, the string is broken down to a set of parts, each of which is called a token. Each token is classified as a word, as punctuation, or as whitespace. The process of breaking down strings into tokens is called tokenization. Tokenization occurs during document loading as well as during query evaluation, and they are independent of each other.

Tokenization is language-specific; that is, a given string is tokenized differently depending on the language in which it is tokenized. The language is determined based on the language specified at load or query time (or the database default language if no language is specified) and on any xml:lang attributes in the content (for details, see xml:lang Attribute).

Note the following about the way strings are tokenized in MarkLogic Server:

The cts:tokenize API will return how text is tokenized in the specified language.

Using xdmp:describe of a cts:tokenize expression returns the tokens and the type of tokens produced from the specified string. For example:

xdmp:describe(cts:tokenize("this is, obviously, a phrase", "en"), 100)
=> (cts:word("this"), cts:space(" "), cts:word("is"),
   cts:punctuation(","), cts:space(" "), cts:word("obviously"),
   cts:punctuation(","), cts:space(" "), cts:word("a"), 
   cts:space(" "), cts:word("phrase"))

Every query has a language associated with it; if the language is not explicitly specified in the cts:query expression, then it takes on the default language of the database.
MarkLogic Server comes configured such that when an element is in an Asian or Middle Eastern language, the Latin characters tokenize as English. This allows searches to find English words inside Asian or Middle Eastern language elements. For example, a search in English can find Latin characters in a Simplified Chinese element as in the following:
```
let $x := <el xml:lang="zh">Chinese-text-here hello</el>
return
$x//el[cts:contains(., 
         cts:word-query("hello", ("stemmed", "lang=en")))]

=> <el xml:lang="zh">Chinese-text-here hello</el>
```
A stemmed search for the Latin characters in a non-English language, however, will not find the non-English word stems (it will only find the non-English word itself, which stems to itself). Similarly, Asian or Middle Eastern characters will tokenize in a language appropriate to the character set, even when they occur in elements that are not in their language. The result is that searches in English sometimes match content that is labeled in an Asian or Middle Eastern character set, and vice-versa. For example, consider the following (zh is the language code for Simplified Chinese):
```
let $x := 
<root>
 <el xml:lang="en">hello</el>
 <el xml:lang="fr">hello</el>
 <el xml:lang="zh">hello</el>
</root>
return
$x//el[cts:contains(., 
         cts:word-query("hello", ("stemmed", "lang=en")))]

=> <el xml:lang="en">hello</el>
   <el xml:lang="zh">hello</el>
```
This search, even though in English, returns both the element in English and the one in Chinese. It returns the Chinese element because the word 'hello' is in Latin characters and therefore tokenizes as English, and it matches the Chinese query (which also tokenizes 'hello' in English).
If your application has specialized tokenization requirements, you can use custom tokenizer overrides to modify how characters are grouped into tokens. For details, see Custom Tokenization.

Stemmed Searches in Different Languages

A stemmed search for a term matches all the terms that have the same stem as the search term (which includes the exact same terms in the language specified in the query). Words that are derived from the same meaning and part of speech have the same stem (for example, 'mouse' and 'mice'). Some words can have multiple stems (if the same word can be used as a different part of speech, or if there are two words with the same spelling), and if you use advanced stemming (which can find multiple stems for a word), then stemmed searches find all of the words having the same stem as any of the stems. The purpose of stemming is to increase the recall for a search. For details about how stemming works in MarkLogic Server, including the different stemming types of stemming available, see Understanding and Using Stemmed Searches. This sections describes how the language settings affect stemmed searches.

To get the stem of a search term, you must take the language into consideration. For example, the word 'chat' is a different word in French than it is in English (in French, it is a noun meaning 'cat', in English, it is a verb meaning to converse informally). In French, 'chatting' is not a word, and therefore it does not stem to 'chat'. But in English, 'chatting' does stem to 'chat'. Therefore, stemming is language-specific, and stemmed searches in one language might find different results than stemmed searches in another.

At query time, you can specify a language (or if you do not specify a language, the default language of the database is used). This language is used when performing a stemmed search. The language specification is in the options to the cts:query expression. For example, the following cts:query expression specifies a stemmed search in French for the word 'chat', and it only matches tokens that are stemmed in French.

cts:word-query("chat", ("stemmed", "lang=fr"))

For more details about how languages affect queries, see Querying Documents By Languages.

At load time, the specified language is used to determine in which language to stem the words in the document. For more details about the language aspects of loading documents, see Language Aspects of Loading and Updating Documents.

For details about the syntax of the various cts:query constructors, see the MarkLogic XQuery and XSLT Function Reference.

Language Aspects of Loading and Updating Documents

This section describes the impact of languages on loading and updating documents, and includes the following sections:

Tokenization and Stemming
xml:lang Attribute
Language-Related Notes About Loading and Updating Documents

Tokenization and Stemming

Tokenization and stemming occur when loading documents, just as they do when querying documents (for details, see Language-Specific Tokenization and Stemmed Searches in Different Languages). When loading documents, the stemmed search indexes are created based on the language. The tokenization and stemming at load time is completely independent from the tokenization and stemming at query time.

xml:lang Attribute

You can specify languages in XML documents at the element level by using the xml:lang attribute. MarkLogic Server uses the xml:lang attribute to determine the language with which to tokenize and stem the contents of that element. Note the following about the xml:lang attribute:

The xml:lang attribute (see http://www.w3.org/TR/2006/REC-xml-20060816/#sec-lang-tag) has some special properties such as not needing to declare the namespace bound to the xml prefix, and that it is inherited by all children of the element (unless they explicitly have a different xml:lang value).
You can explicitly add an xml:lang attribute to the root node of a document during loading by specifying the default-language option to xdmp:document-load; without the default-language option, the root node will remain as-is.
If no xml:lang attribute is present, then the document is processed in the default language of the database into which it is loaded.
For the purpose of indexing terms, the language specified by the xml:lang attribute only applies to stemmed search terms; the word searches (unstemmed) database configuration setting indexes terms irrespective of language. Tokenization of terms honors the xml:lang value for both stemmed searches and word searches index settings in the database configuration.
All of the text node children and text node descendants of an element with an xml:lang attribute are treated as the language specified in the xml:lang attribute, unless a child element has an xml:lang attribute with a different value. If so, any text node children and text node descendants are treated as the new language, and so on until no other xml:lang attributes are encountered.
The value of the xml:lang attribute must conform to the following lexical standard: http://www.ietf.org/rfc/rfc3066.txt. The following are some typical xml:lang attributes (specifying French, Simplified Chinese, and English, respectively):
```
xml:lang="fr"
xml:lang="zh"
xml:lang="en"
```
If an element has an xml:lang attribute with a value of the empty string (xml:lang=""), then any xml:lang value in effect (from some ancestor xml:lang value) is overridden for that element; its value takes on the database language default. Additionally, if a default-language option is specified during loading, any empty string xml:lang values are replaced with the language specified in the default-language option. For example, consider the following XML:
```
<rhone xml:lang="fr">
   <wine>vin rouge</wine>
   <wine xml:lang="">red wine</wine>
</rhone>
```
In this sample, the phrase 'vin rouge' is treated as French, and the phrase 'red wine' is treated in the default language for the database (English by default).
If this sample was loaded with a default-language option specifying Italian (specifying <default-language>it</default-language> for the xdmp:document-load option, for example), then the resulting document would be as follows:
```
<rhone xml:lang="fr">
   <wine>vin rouge</wine>
   <wine xml:lang="it">red wine</wine>
</rhone>
```

Language-Related Notes About Loading and Updating Documents

When you load content into MarkLogic Server, it determines how to index the content based on several factors, including the language specified during the load operation, the default language of the database, and any languages encoded into the content with xml:lang attributes. Note the following about languages with respect to loading content, updating content, and changing language settings on a database:

Changing the default language starts a reindex operation if reindex enable is set to true.
Documents with no xml:lang attribute are indexed upon load or update in the database default language.
Any content within an element having an xml:lang attribute is indexed in that language. Additionally, the xml:lang value is inherited by all of the descendants of that element, until another xml:lang value is encountered.
MarkLogic Server comes configured such that when an element is in an Asian or Middle Eastern language, the Latin characters tokenize as English. Therefore, a document with Latin characters in a non-English language will create stemmed index terms in English for those Latin characters. Similarly, Asian or Middle Eastern characters will tokenize in their respective languages, even in elements that are not in their language.

Querying Documents By Languages

Full-text search queries (queries that use cts:search or cts:contains) are language-aware; that is, they search for text, tokenize the search terms, and stem (if enabled) in a particular language. This section describes how queries are language-aware and describes their behavior. It includes the following topics:

Tokenization, Stemming, and the xml:lang Attribute
Language-Aware Searches
Unstemmed Searches
Unknown Languages

Tokenization, Stemming, and the xml:lang Attribute

Tokenization and stemming are both language-specific; that is, a string can be tokenized and stemmed differently in different languages. For searches, the language is specified by the cts:query constructors (or by the default language of the database if a language is not specified). For more details, see Tokenization and Stemming. For nodes constructed in XQuery, any xml:lang attributes are treated the same way as if the document were loaded into a database. For details, see xml:lang Attribute.

Language-Aware Searches

All searches in MarkLogic Server are language-aware. You can construct searches using cts:search or cts:contains, each of which takes a cts:query expression. Each leaf-level cts:query constructor in the cts:query expression specifies a language (or defaults to a language). For details on the cts:query constructors, see Composing cts:query Expressions.

All searches use the language setting in the cts:query constructor to determine how to tokenize the search terms. Stemmed searches also use the language setting to derive stems. Unstemmed searches use the specified language for tokenization but use the unstemmed (word searches) indexes, which are language-independent.

Unstemmed Searches

An unstemmed search matches terms that are exactly like the search term; it does not take into consideration the stem of the word. Unstemmed searches match terms in a language independent way, but tokenize the search according to the specified language. Therefore, when you specify a language in an unstemmed query, the language applies only to tokenization; the unstemmed query will match any text in any language that matches the query.

Note the following characteristics of unstemmed searches:

Unstemmed searches require word search indexes, otherwise they throw an exception (this is a change of behavior from MarkLogic Server 3.1, see the Release Notes for details). You can perform unstemmed searches without word search indexes using cts:contains, however. To perform unstemmed searches without the word search indexes enabled, use a let to bind the results of a stemmed search to a variable, and then filter the results using cts:contains with an unstemmed query.
The following example demonstrates this. The following example binds the stemmed search results to a variable, then iterates over the results, filtering out all but the unstemmed results in the where clause (using cts:contains with a cts:query that specifies the unstemmed option).
```
let $search := cts:search(doc(), cts:word-query("my words", 
                                  ("stemmed", "lang=en")))
for $x in $search
where cts:contains($x, cts:word-query("my words", "unstemmed"))
return $x
```
While it is likely that everything returned by this search will have an English match to the cts:query, it does not necessarily guarantee that everything returned is in English. Because this search returns documents, it is possible for a document to contain words in another language that do not match the language-specific query, but do match the unstemmed query (if the document contains text in multiple languages, and if it has 'my words' in some other language than the one specified in the stemmed cts:query).
The word search indexes have no language information.
Unstemmed searches use the lang=<language> option to determine the language for tokenization.
Unstemmed searches search all content, regardless of language (and regardless of lang=<language> option). The language only affects how the search terms are tokenized. For example, the following unstemmed search returns true:
```
(: returns true :)
let $x := <el xml:lang="fr">chat</el>
return
cts:contains($x, cts:word-query("chat", ("unstemmed", "lang=en")))
```
whereas the following stemmed search returns false:
```
(: returns false :)
let $x := <el xml:lang="fr">chat</el>
return
cts:contains($x, cts:word-query("chat", ("stemmed", "lang=en")))
```

Unknown Languages

If the language specified in a search is not one of the languages in which language-specific stemming and tokenization are supported, or if it is a language for which you do not have a license key, then it is treated as a generic language. Typically, generic languages with Latin script are tokenized the same way as English, with token breaks at whitespace and punctuation, and with each word stemming to itself, but this is not always the case (especially for languages supported by MarkLogic Server--see Supported Languages--but for which you are not licensed). For details, see Generic Language Support.

Supported Languages

This section lists languages with advanced stemming and tokenization support in MarkLogic Server. All of the languages except English require a license key with support for the language. If your license key does not include support for a given language, the language is treated as a generic language (see Generic Language Support). The following are the supported languages:

English
French
Italian
German
Russian
Spanish
Arabic
Chinese (Simplified and Traditional)
Korean
Persian (Farsi)
Dutch
Japanese
Portuguese
Norwegian (Nynorsk and Bokmål)
Swedish

For a list of base collations and character sets used with each language, see Collations and Character Sets By Language.

Generic Language Support

You can load and query documents in any language into MarkLogic Server, as long as you can convert the character encoding to UTF-8. If the language is not one of the languages with advanced support, or if the language is one for which you are not licensed, then the tokenization is performed in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters), and each term stems to itself.

For example, if you load the following document:

<doc xml:lang="nn">
  <a>Some text in any language here.</a>
</doc>

then that document is loaded as the language nn, and a stemmed search in any other language would not match. Therefore, the following does not match the document:

(: does not match because it was stemmed as "nn" :)
cts:search(doc(), cts:word-query("language", ("stemmed", "lang=en"))

and the following search does match the document:

(: does match because the query specifies "nn" as the language :)
cts:search(doc(), cts:word-query("language", ("stemmed", "lang=nn"))

For generic language searches in languages for which MarkLogic does not provide advanced language support (the languages described in Supported Languages), those languages are all treated as the same language for stemmed searches. Therefore, a stemmed search that matches a document in one language without advanced language support will also match a document in another language without advanced language support.

Generic language support allows you to query documents in any language, regardless of which languages you are licensed for or which languages have advanced support. Because the generic language support only stems words to themselves, queries in these languages will not include variations of words based on their meanings in the results. If you desire further support than the generic language support for some particular language, contact MarkLogic Technical Support.

« Previous chapter

Next chapter »

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold

Telerik

Kendo UI

Fiddler

Test Studio

MOVEit

WS_FTP

Search Developer's Guide — Chapter 28