In MarkLogic Server, the language of the content is specified when you load the content and the language of the query is specified when you query the content. At load-time, the content is tokenized, indexed, and stemmed (if enabled) based on the language specified during the load. Also, MarkLogic Server uses any languages specified at the element level in the XML markup for the content (see xml:lang Attribute), making it possible to load documents with multiple languages. Similarly, at query time, search terms are tokenized (and stemmed) based on the language specified in the
cts:query expression. The result is that a query performed in one language might not yield the same results as the same query performed in another language, as both the indexes that store the information about the content and the queries against the content are language-aware.
Even if your content is entirely in a single language, MarkLogic Server is still multiple-language aware. If your content is all in a single language, and if that language is the default language for that database, and if the content does not have any language (
xml:lang) attributes, and if your queries all specify (or default to) the language in which the content is loaded, then everything will behave as if there is a single language.
Because MarkLogic Server is multiple-language aware, it is important to understand the fundamental aspects of languages when loading and querying content in MarkLogic Server. The remainder of this chapter, particularly Language Aspects of Loading and Updating Documents and Querying Documents By Languages, describe these details.
To understand the language implications of querying and loading documents, you must first understand tokenization and stemming, which are both language-specific. This section describes these topics, and has the following parts:
When you search for a string (typically a word or a phrase) in MarkLogic Server, or when you load content (which is made up of text strings) into MarkLogic Server, the string is broken down to a set of parts, each of which is called a token. Each token is classified as a word, as punctuation, or as whitespace. The process of breaking down strings into tokens is called tokenization. Tokenization occurs during document loading as well as during query evaluation, and they are independent of each other.
Tokenization is language-specific; that is, a given string is tokenized differently depending on the language in which it is tokenized. The language is determined based on the language specified at load or query time (or the database default language if no language is specified) and on any
xml:lang attributes in the content (for details, see xml:lang Attribute).
cts:queryexpression, then it takes on the default language of the database.
let $x := <el xml:lang="zh">Chinese-text-here hello</el> return $x//el[cts:contains(., cts:word-query("hello", ("stemmed", "lang=en")))] => <el xml:lang="zh">Chinese-text-here hello</el>
A stemmed search for the Latin characters in a non-English language, however, will not find the non-English word stems (it will only find the non-English word itself, which stems to itself). Similarly, Asian or Middle Eastern characters will tokenize in a language appropriate to the character set, even when they occur in elements that are not in their language. The result is that searches in English sometimes match content that is labeled in an Asian or Middle Eastern character set, and vice-versa. For example, consider the following (
zh is the language code for Simplified Chinese):
let $x := <root> <el xml:lang="en">hello</el> <el xml:lang="fr">hello</el> <el xml:lang="zh">hello</el> </root> return $x//el[cts:contains(., cts:word-query("hello", ("stemmed", "lang=en")))] => <el xml:lang="en">hello</el> <el xml:lang="zh">hello</el>
This search, even though in English, returns both the element in English and the one in Chinese. It returns the Chinese element because the word 'hello' is in Latin characters and therefore tokenizes as English, and it matches the Chinese query (which also tokenizes 'hello' in English).
A stemmed search for a term matches all the terms that have the same stem as the search term (which includes the exact same terms in the language specified in the query). Words that are derived from the same meaning and part of speech have the same stem (for example, 'mouse' and 'mice'). Some words can have multiple stems (if the same word can be used as a different part of speech, or if there are two words with the same spelling), and if you use advanced stemming (which can find multiple stems for a word), then stemmed searches find all of the words having the same stem as any of the stems. The purpose of stemming is to increase the recall for a search. For details about how stemming works in MarkLogic Server, including the different stemming types of stemming available, see Understanding and Using Stemmed Searches. This sections describes how the language settings affect stemmed searches.
To get the stem of a search term, you must take the language into consideration. For example, the word 'chat' is a different word in French than it is in English (in French, it is a noun meaning 'cat', in English, it is a verb meaning to converse informally). In French, 'chatting' is not a word, and therefore it does not stem to 'chat'. But in English, 'chatting' does stem to 'chat'. Therefore, stemming is language-specific, and stemmed searches in one language might find different results than stemmed searches in another.
At query time, you can specify a language (or if you do not specify a language, the default language of the database is used). This language is used when performing a stemmed search. The language specification is in the options to the
cts:query expression. For example, the following
cts:query expression specifies a stemmed search in French for the word 'chat', and it only matches tokens that are stemmed in French.
For more details about how languages affect queries, see Querying Documents By Languages.
At load time, the specified language is used to determine in which language to stem the words in the document. For more details about the language aspects of loading documents, see Language Aspects of Loading and Updating Documents.
For details about the syntax of the various
cts:query constructors, see the MarkLogic XQuery and XSLT Function Reference.
Tokenization and stemming occur when loading documents, just as they do when querying documents (for details, see Language-Specific Tokenization and Stemmed Searches in Different Languages). When loading documents, the
stemmed search indexes are created based on the language. The tokenization and stemming at load time is completely independent from the tokenization and stemming at query time.
You can specify languages in XML documents at the element level by using the
xml:lang attribute. MarkLogic Server uses the
xml:lang attribute to determine the language with which to tokenize and stem the contents of that element. Note the following about the
xml:langattribute (see http://www.w3.org/TR/2006/REC-xml-20060816/#sec-lang-tag) has some special properties such as not needing to declare the namespace bound to the
xmlprefix, and that it is inherited by all children of the element (unless they explicitly have a different
xml:langattribute to the root node of a document during loading by specifying the
default-languageoption to xdmp:document-load; without the
default-languageoption, the root node will remain as-is.
xml:langattribute is present, then the document is processed in the default language of the database into which it is loaded.
xml:langattribute only applies to stemmed search terms; the
word searches(unstemmed) database configuration setting indexes terms irrespective of language. Tokenization of terms honors the
xml:langvalue for both
word searchesindex settings in the database configuration.
xml:langattribute are treated as the language specified in the
xml:langattribute, unless a child element has an
xml:langattribute with a different value. If so, any text node children and text node descendants are treated as the new language, and so on until no other
xml:langattributes are encountered.
xml:langattribute must conform to the following lexical standard: http://www.ietf.org/rfc/rfc3066.txt. The following are some typical
xml:langattributes (specifying French, Simplified Chinese, and English, respectively):
xml:langattribute with a value of the empty string (
xml:lang=""), then any
xml:langvalue in effect (from some ancestor
xml:langvalue) is overridden for that element; its value takes on the database language default. Additionally, if a
default-languageoption is specified during loading, any empty string
xml:langvalues are replaced with the language specified in the
default-languageoption. For example, consider the following XML:
If this sample was loaded with a default-language option specifying Italian (specifying
<default-language>it</default-language> for the
xdmp:document-load option, for example), then the resulting document would be as follows:
When you load content into MarkLogic Server, it determines how to index the content based on several factors, including the language specified during the load operation, the default language of the database, and any languages encoded into the content with
xml:lang attributes. Note the following about languages with respect to loading content, updating content, and changing language settings on a database:
reindex enableis set to
xml:langattribute are indexed upon load or update in the database default language.
xml:langattribute is indexed in that language. Additionally, the
xml:langvalue is inherited by all of the descendants of that element, until another
xml:langvalue is encountered.
Full-text search queries (queries that use cts:search or cts:contains) are language-aware; that is, they search for text, tokenize the search terms, and stem (if enabled) in a particular language. This section describes how queries are language-aware and describes their behavior. It includes the following topics:
Tokenization and stemming are both language-specific; that is, a string can be tokenized and stemmed differently in different languages. For searches, the language is specified by the cts:query constructors (or by the default language of the database if a language is not specified). For more details, see Tokenization and Stemming. For nodes constructed in XQuery, any
xml:lang attributes are treated the same way as if the document were loaded into a database. For details, see xml:lang Attribute.
All searches in MarkLogic Server are language-aware. You can construct searches using
cts:search or cts:contains, each of which takes a
cts:query expression. Each leaf-level
cts:query constructor in the
cts:query expression specifies a language (or defaults to a language). For details on the
cts:query constructors, see Composing cts:query Expressions.
All searches use the language setting in the
cts:query constructor to determine how to tokenize the search terms. Stemmed searches also use the language setting to derive stems. Unstemmed searches use the specified language for tokenization but use the unstemmed (
word searches) indexes, which are language-independent.
An unstemmed search matches terms that are exactly like the search term; it does not take into consideration the stem of the word. Unstemmed searches match terms in a language independent way, but tokenize the search according to the specified language. Therefore, when you specify a language in an unstemmed query, the language applies only to tokenization; the unstemmed query will match any text in any language that matches the query.
word searchindexes, otherwise they throw an exception (this is a change of behavior from MarkLogic Server 3.1, see the Release Notes for details). You can perform unstemmed searches without
word searchindexes using cts:contains, however. To perform unstemmed searches without the
word searchindexes enabled, use a
letto bind the results of a stemmed search to a variable, and then filter the results using cts:contains with an unstemmed query.
The following example demonstrates this. It binds the unstemmed search to a varaiable, then iterates over the results of the search in a FLWOR loop, filtering out all but the unstemmed results in the
where clause (using
cts:contains with a
cts:query that specifies the
let $search := cts:search(doc(), cts:word-query("my words", ("stemmed", "lang=en"))) for $x in $search where cts:contains($x, cts:word-query("my words", "unstemmed")) return $x
While it is likely that everything returned by this search will have an English match to the
cts:query, it does not necessarily guarantee that everything returned is in English. Because this search returns documents, it is possible for a document to contain words in another language that do not match the language-specific query, but do match the unstemmed query (if the document contains text in multiple languages, and if it has 'my words' in some other language than the one specified in the stemmed
word searchindexes have no language information.
lang=<language>option to determine the language for tokenization.
lang=<language>option). The language only affects how the search terms are tokenized. For example, the following unstemmed search returns true:
(: returns true :) let $x := <el xml:lang="fr">chat</el> return cts:contains($x, cts:word-query("chat", ("unstemmed", "lang=en")))
If the language specified in a search is not one of the languages in which language-specific stemming and tokenization are supported, or if it is a language for which you do not have a license key, then it is treated as a generic language. Typically, generic languages with Latin script are tokenized the same way as English, with token breaks at whitespace and punctuation, and with each word stemming to itself, but this is not always the case (especially for languages supported by MarkLogic Server--see Supported Languages--but for which you are not licensed). For details, see Generic Language Support.
This section lists languages with advanced stemming and tokenization support in MarkLogic Server. All of the languages except English require a license key with support for the language. If your license key does not include support for a given language, the language is treated as a generic language (see Generic Language Support). The following are the supported languages:
For a list of base collations and character sets used with each language, see Collations and Character Sets By Language.
You can load and query documents in any language into MarkLogic Server, as long as you can convert the character encoding to UTF-8. If the language is not one of the languages with advanced support, or if the language is one for which you are not licensed, then the tokenization is performed in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters), and each term stems to itself.
(: does not match because it was stemmed as "nn" :) cts:search(doc(), cts:word-query("language", ("stemmed", "lang=en"))
(: does match because the query specifies "nn" as the language :) cts:search(doc(), cts:word-query("language", ("stemmed", "lang=nn"))
For generic language searches in languages for which MarkLogic does not provide advanced language support (the languages described in Supported Languages), those languages are all treated as the same language for stemmed searches. Therefore, a stemmed search that matches a document in one language without advanced language support will also match a document in another language without advanced language support.
Generic language support allows you to query documents in any language, regardless of which languages you are licensed for or which languages have advanced support. Because the generic language support only stems words to themselves, queries in these languages will not include variations of words based on their meanings in the results. If you desire further support than the generic language support for some particular language, contact MarkLogic Technical Support.