Loading TOC...
Matches for cat:guide/search-dev have been highlighted. remove
Search Developer's Guide (PDF)

MarkLogic Server 11.0 Product Documentation
Search Developer's Guide
— Chapter 28

Language Support in MarkLogic Server

MarkLogic Server supports loading and querying content in multiple languages. This chapter describes how languages are handled in MarkLogic Server, and includes the following sections:

Overview of Language Support in MarkLogic Server

In MarkLogic Server, the language of the content is specified when you load the content and the language of the query is specified when you query the content. At load-time, the content is tokenized, indexed, and stemmed (if enabled) based on the language specified during the load. Also, MarkLogic Server uses any languages specified at the element level in the XML markup for the content (see xml:lang Attribute), making it possible to load documents with multiple languages. In a JSON document, the language or lang properties are used for the same purpose.

Similarly, at query time, search terms are tokenized (and stemmed) based on the language specified in the cts:query expression. The result is that a query performed in one language might not yield the same results as the same query performed in another language, as both the indexes that store the information about the content and the queries against the content are language-aware.

Even if your content is entirely in a single language, MarkLogic Server is still multiple-language aware. For MarkLogic to behave as if there is only a single language, all the following must be true:

  • Your content is all in a single language.
  • That language is the default language for that database.
  • The XML content doesn't include any xml:lang attributes.
  • The JSON content doesn't include any language or lang properties.
  • Your queries all explicitly specify (or default to) the language in which the content is loaded.

Because MarkLogic Server is multiple-language aware, it is important to understand the fundamental aspects of languages when loading and querying content in MarkLogic Server. The remainder of this chapter describes these details, particularly the following topics:

Tokenization and Stemming

To understand the language implications of querying and loading documents, you must first understand tokenization and stemming, which are both language-specific. This section describes these topics, and has the following parts:

Language-Specific Tokenization

When you search for a string (typically a word or a phrase) in MarkLogic Server, or when you load content (which is made up of text strings) into MarkLogic Server, the string is split into parts, each of which is called a token. Each token is classified as a word, punctuation, or whitespace. The process of breaking down strings into tokens is called tokenization.

Tokenization occurs during document loading as well as during query evaluation. The two processes are independent of each other. The tokenization of documents during loading affects indexing. The tokenization of query text affects how search terms are resolved. Though the processes are independent, they use the same tokenizer (for a given language).

Tokenization is language-specific; that is, a given string is tokenized differently depending on the language in which it is tokenized. The language is determined based on the language specified at load or query time (or the database default language if no language is specified) and on any xml:lang attributes in the XML content (for details, see xml:lang Attribute). For JSON content, the language or lang properties determine language-specific tokenization (for details, see Language Support in JSON).

Note the following about the way strings are tokenized in MarkLogic Server:

  • You can use cts:tokenize XQuery function or the cts.tokenize Server-Side JavaScript function to see how text is tokenized for a given language.
  • If you wrap a call to xdmp:describe around a call to cts:tokenize in XQuery, you can review both the tokens and their classification. Similarly if you wrap xdmp.describe around a call to cts.tokenize in JavaScript. For example:
    xdmp:describe(cts:tokenize("this is, obviously, a phrase", "en"), 100)
    => (cts:word("this"), cts:space(" "), cts:word("is"),
       cts:punctuation(","), cts:space(" "), cts:word("obviously"),
       cts:punctuation(","), cts:space(" "), cts:word("a"), 
       cts:space(" "), cts:word("phrase"))
  • Every query has a language associated with it; if the language is not explicitly specified in the cts:query expression, then it takes on the default language of the database.
  • MarkLogic Server comes configured such that when an element is in an Asian or Middle Eastern language, the Latin characters tokenize as English. This allows searches to find English words inside Asian or Middle Eastern language elements. For example, a search in English can find Latin characters in a Simplified Chinese element as in the following:
    let $x := <el xml:lang="zh">Chinese-text-here hello</el>
    return
    $x//el[cts:contains(., 
             cts:word-query("hello", ("stemmed", "lang=en")))]
    
    => <el xml:lang="zh">Chinese-text-here hello</el>

    A stemmed search for the Latin characters in a non-English language, however, will not find the non-English word stems (it will only find the non-English word itself, which stems to itself). Similarly, Asian or Middle Eastern characters will tokenize in a language appropriate to the character set, even when they occur in elements that are not in their language. The result is that searches in English sometimes match content that is labeled in an Asian or Middle Eastern character set, and vice-versa. For example, consider the following (zh is the language code for Simplified Chinese):

    let $x := 
    <root>
     <el xml:lang="en">hello</el>
     <el xml:lang="fr">hello</el>
     <el xml:lang="zh">hello</el>
    </root>
    return
    $x//el[cts:contains(., 
             cts:word-query("hello", ("stemmed", "lang=en")))]
    
    => <el xml:lang="en">hello</el>
       <el xml:lang="zh">hello</el>

    This search, even though in English, returns both the element in English and the one in Chinese. It returns the Chinese element because the word hello is in Latin characters and therefore tokenizes as English, and it matches the Chinese query (which also tokenizes hello in English).

  • If your application has specialized tokenization requirements, you can use custom tokenizer overrides or a custom lexer to modify how characters are grouped into tokens. For details, see Custom Tokenization.

Stemmed Searches in Different Languages

A stemmed search for a term matches all the terms that have the same stem as the search term (which includes the exact same terms in the language specified in the query). The purpose of stemming is to increase the recall for a search. For details about how stemming works in MarkLogic Server, including the different types of stemming available, see Understanding and Using Stemmed Searches. This section describes how the language settings affect stemmed searches.

Words derived from the same meaning and part of speech have the same stem (for example, mouse and mice). A word can have multiple stems if the word can be used as multiple parts of speech (for example, play can be both a noun and a verb in English), or if there are two words with the same spelling. If you enable advanced stemming, then stemmed searches find all of the words having the same stem as any of the stems. Advanced stemming finds multiple stems for a word.

Stemming is a language-specific operation. For example, the word chat is a different word in French than it is in English. In French, chat is a noun meaning cat, while in English, it is a verb. In French, chatting is not a word, and therefore it does not stem to chat. But in English, chatting does stem to chat. Therefore, stemmed searches in one language might find different results than stemmed searches in another.

When you construct a query, you can specify a language to use for stemmed search. For example, the following cts:query expression specifies a stemmed search in French for the word chat, and it only matches tokens that are stemmed in French.

cts:word-query("chat", ("stemmed", "lang=fr"))

For more details about how languages affect queries, see Querying Documents By Languages.

At load time, the specified language is used to determine in which language to stem the words in the document. For more details about the language aspects of loading documents, see Language Aspects of Loading and Updating Documents.

For details about the syntax of the various cts:query constructors, see the MarkLogic XQuery and XSLT Function Reference.

Language Aspects of Loading and Updating Documents

This section describes the impact of languages on loading and updating documents, and includes the following sections:

Tokenization and Stemming

Tokenization and stemming occur when loading documents, just as they do when querying documents (for details, see Language-Specific Tokenization and Stemmed Searches in Different Languages). When loading documents, the stemmed search indexes are created based on the language. The tokenization and stemming at load time is completely independent from the tokenization and stemming at query time.

xml:lang Attribute

You can specify languages in XML documents at the element level by using the xml:lang attribute. MarkLogic Server uses the xml:lang attribute to determine the language with which to tokenize and stem the contents of that element. Note the following about the xml:lang attribute:

  • The xml:lang attribute (see https://www.w3.org/TR/xml/#sec-lang-tag) has some special properties such as not needing to declare the namespace bound to the xml prefix, and that it is inherited by all children of the element (unless they explicitly have a different xml:lang value).
  • You can explicitly add an xml:lang attribute to the root node of an XML document during loading by specifying the default-language option to xdmp:document-load; without the default-language option, the root node will remain as-is.
  • If no xml:lang attribute is present, then the document is processed in the default language of the database into which it is loaded.
  • For the purpose of indexing terms, the language specified by the xml:lang attribute only applies to stemmed search terms; the word searches (unstemmed) database configuration setting indexes terms irrespective of language. Tokenization of terms honors the xml:lang value for both stemmed searches and word searches index settings in the database configuration.
  • All of the text node children and text node descendants of an element with an xml:lang attribute are treated as the language specified in the xml:lang attribute, unless a child element has an xml:lang attribute with a different value. If so, any text node children and text node descendants are treated as the new language, and so on until no other xml:lang attributes are encountered.
  • The value of the xml:lang attribute must conform to the following lexical standard: http://www.ietf.org/rfc/rfc3066.txt. The following are some typical xml:lang attributes (specifying French, Simplified Chinese, and English, respectively):
    xml:lang="fr"
    xml:lang="zh"
    xml:lang="en"
  • If an element has an xml:lang attribute with a value of the empty string (xml:lang=""), then any xml:lang value in effect (from some ancestor xml:lang value) is overridden for that element; its value takes on the database language default. Additionally, if a default-language option is specified during loading, any empty string xml:lang values are replaced with the language specified in the default-language option. For example, consider the following XML:
    <rhone xml:lang="fr">
       <wine>vin rouge</wine>
       <wine xml:lang="">red wine</wine>
    </rhone>

    In this sample, the phrase vin rouge is treated as French, and the phrase red wine is treated in the default language for the database (English by default).

    If this sample was loaded with a default-language option specifying Italian (specifying <default-language>it</default-language> for the xdmp:document-load option, for example), then the resulting document would be as follows:

    <rhone xml:lang="fr">
       <wine>vin rouge</wine>
       <wine xml:lang="it">red wine</wine>
    </rhone>

Language-Related Notes About Loading and Updating Documents

When you load content into MarkLogic Server, it determines how to index the content based on several factors, including the language specified during the load operation, the default language of the database, and any languages encoded into the XML content with xml:lang attributes, or into the JSON content with language or lang properties. Note the following about languages with respect to loading content, updating content, and changing language settings on a database:

  • Changing the default language starts a reindex operation if reindex enable is set to true.
  • XML documents with no xml:lang attribute are indexed upon load or update in the database default language.
  • JSON documents with no language or lang properties are indexed upon load or update in the database default language.
  • Any XML content within an element having an xml:lang attribute is indexed in that language. Additionally, the xml:lang value is inherited by all of the descendants of that element, until another xml:lang value is encountered.
  • Any JSON content within a scope that contains a language or lang property is indexed in that language. Additionally, the language or lang property is inherited by all of the descendants of that element, until another language or lang property is encountered.
  • MarkLogic Server comes configured such that when an element is in an Asian or Middle Eastern language, the Latin characters tokenize as English. Therefore, a document with Latin characters in a non-English language will create stemmed index terms in English for those Latin characters. Similarly, Asian or Middle Eastern characters will tokenize in their respective languages, even in elements that are not in their language.

Protecting JSON Files That Should not be Stemmed

The special zxx language code, which means no natural language present, allows users to protect their own configuration files (or other documents, elements, properties that contain no human-readable content) from customized and plugin tokenizers, as well as from stemmers. In the absence of these language codes, text will always be processed using the default database language. These files also process faster, because they are never stemmed and only use a simple tokenizer.

Querying Documents By Languages

Full-text search queries (queries that use cts:search or cts:contains) are language-aware; that is, they search for text, tokenize the search terms, and stem (if enabled) in a particular language. This section describes how queries are language-aware and describes their behavior. It includes the following topics:

Tokenization, Stemming, and the xml:lang Attribute

Tokenization and stemming are both language-specific; that is, a string can be tokenized and stemmed differently in different languages. By default, a query uses the default language of the database. You can also specify a language when constructing a query. For more details, see Tokenization and Stemming.

For XML nodes constructed in XQuery, any xml:lang attributes are treated the same way as if the document were loaded into a database. For details, see xml:lang Attribute.

Constructed JSON nodes that contain the language or lang properties are indexed in that language. If neither of these properties is present, then they use the default language configured for the database.

Language-Aware Searches

All searches in MarkLogic Server are language-aware. You can specify a language when constructing a query. For example, most cts:query constructors accept a language option. If the language is not explicitly specified, MarkLogic uses the default language configured for the database. For details on the cts:query constructors, see Composing cts:query Expressions.

The language governing a query determines how to tokenize the search terms, whether stemmed search is enabled or not. If stemmed search is enabled, the language is also used to derive stems. Unstemmed searches use the unstemmed (word searches) indexes, which are language independent.

Unstemmed Searches

An unstemmed search matches terms that are exactly like the search term; it does not take into consideration the stem of the word. Unstemmed searches match terms in a language independent way, but tokenize the search according to the specified language. Therefore, when you specify a language in an unstemmed query, the language applies only to tokenization; the unstemmed query will match any text in any language that matches the query.

Note the following characteristics of unstemmed searches:

  • Unstemmed searches require word search indexes, otherwise they throw an exception. However, you can perform unstemmed searches without word search indexes using cts:contains. To perform unstemmed searches without the word search indexes enabled, use a let to bind the results of a stemmed search to a variable, and then filter the results using cts:contains with an unstemmed query.

    The following example binds the stemmed search results to a variable, then iterates over the results, filtering out all but the unstemmed results in the where clause (using cts:contains with a cts:query that specifies the unstemmed option).

    let $search := cts:search(doc(), cts:word-query("my words", 
                                      ("stemmed", "lang=en")))
    for $x in $search
    where cts:contains($x, cts:word-query("my words", "unstemmed"))
    return $x

    While it is likely that everything returned by this search will have an English match to the cts:query, it is not guaranteed that everything returned is in English. It is possible for a document to contain words in another language that do not match the language-specific query, but do match the unstemmed query (if the document contains text in multiple languages, and if it has my words in some other language than the one specified in the stemmed cts:query).

  • The word search indexes are language-agnostic.
  • Unstemmed searches use the lang=<language> query constructor option to determine the language for tokenization.
  • Unstemmed searches search all content, regardless of language (and regardless of lang=<language> option). The language only affects how the search terms are tokenized. For example, the following unstemmed search returns true:
    (: returns true :)
    let $x := <el xml:lang="fr">chat</el>
    return
    cts:contains($x, cts:word-query("chat", ("unstemmed", "lang=en")))

    whereas the following stemmed search returns false:

    (: returns false :)
    let $x := <el xml:lang="fr">chat</el>
    return
    cts:contains($x, cts:word-query("chat", ("stemmed", "lang=en")))

Unknown Languages

If the language specified in a search is not one of the languages in which language-specific stemming and tokenization are supported, or if it is a language for which you do not have a license key, then it is treated as a generic language. Typically, generic languages with Latin script are tokenized the same way as English, with token breaks at whitespace and punctuation, and with each word stemming to itself, but this is not always the case (especially for languages supported by MarkLogic Server--see Supported Languages--but for which you are not licensed). For details, see Generic Language Support.

You can implement a custom lexer (for tokenization) and stemmer if the default behavior for an unsupported language does not meet the needs of your application. For details, see User-Defined Lexer Plugins and Using a User-Defined Stemmer Plugin.

Supported Languages

This section lists languages with advanced stemming and tokenization support in MarkLogic Server. All of the languages except English require a license key with support for the language. If your license key does not include support for a given language, the language is treated as a generic language (see Generic Language Support). The following are the supported languages:

  • English
  • French
  • Italian
  • German
  • Russian
  • Spanish
  • Arabic
  • Chinese (Simplified and Traditional)
  • Korean
  • Persian (Farsi)
  • Dutch
  • Japanese
  • Portuguese
  • Norwegian (Nynorsk and Bokm•l)
  • Swedish

For a list of base collations and character sets used with each language, see Collations and Character Sets By Language.

Generic Language Support

You can load and query documents in any language into MarkLogic Server, as long as you can convert the character encoding to UTF-8. If the language is not one of the languages with advanced support, or if the language is one for which you are not licensed, then the tokenization is performed in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters), and each term stems to itself.

For example, if you load the following document:

<doc xml:lang="cz">
  <a>Some text in any language here.</a>
</doc>

then that document is loaded as the language cz, and a stemmed search in any other language would not match. Therefore, the following does not match the document:

(: does not match because it was stemmed as "cz" :)
cts:search(doc(), cts:word-query("language", ("stemmed", "lang=en"))

The following search does match the document because it uses the same language:

(: does match because the query specifies "cz" as the language :)
cts:search(doc(), cts:word-query("language", ("stemmed", "lang=cz"))

Generic language support enables you to query documents in any language, regardless of which languages you are licensed for or which languages have advanced support. Because the generic language support only stems words to themselves, queries in these languages will not include variations of words based on their meanings in the results.

If you desire more than the generic language support for some unsupported language, you can create a custom lexer and or stemmer plugin to enable language-specific handling. For details, see Stemming and Tokenization Customization.

Stemming and Tokenization Customization

This section summarizes the features available to you for customizing the stemming and tokenization processes. You can use these features separately or together.

Tokenization Customization

With no customizations, each language has a default lexer and default tokenization dictionary associated with it. The default lexer is one of the built-in lexers described in Built-in Lexer Plugin Reference and varies by language.

You can use the following tools to customize tokenization. You can use these features singly or in combination.

  • Define tokenizer overrides. Overrides can affect whether a codepoint is classified as a word, punctuation or whitespace character. Overrides are applied independent of the configured lexer. For details, see Custom Tokenizer Overrides.
  • Custom tokenization dictionary. You can install a custom dictionary to influence how text is tokenized. You configure custom dictionaries per language. For more details, see Custom Dictionaries for Tokenizing and Stemming.
  • Custom lexer. You can use one of the built-in lexer plugins that come with MarkLogic, or create a user-defined lexer plugin using the marklogic::LexerUDF native C++ intefaces. You associate a custom lexer with a specific language. For details, see User-Defined Lexer Plugins and Configuring Tokenization and Stemming Plugins.

You can use tokenization customizations in conjunction with stemming customizations. For details, see Stemming Customization.

Tokenization is a trusted operation. You should be selective about which users can register user-defined lexer plugins and customize language configurations.

Stemming Customization

With no customizations, each language has a base stemmer and stemming dictionary associated with it. The default stemmer is one of the built-in stemmer plugins that come with MarkLogic, and varies by language. For details, see Built-in Stemmer Plugin Reference.

You can use the following tools to customize stemming. You can use these customizations singly or in combination.

You can use stemming customizations in conjunction with tokenization customizations. For details, see Tokenization Customization.

Stemming is a trusted operation. You should be selective about which users can register user-defined stemming plugins and customize language configurations.

Configuring Tokenization and Stemming Plugins

One way you can affect the results of tokenization and stemming is to configure a custom lexer or stemmer plugin for a language. Your customization can use either a built-in or user-defined plugin.

This section provides an overview of how to configure a custom lexer or stemmer for a language using the Custom Language Management library module. The following topics are covered:

For more information on creating user-defined lexer and stemmer plugins, see the following topics:

Function Summary for Custom Language Management

Lexer and stemmer plugin configuration is done through the custom language management library module. The module includes the following functions. For more details, see the XQuery/XSLT Function Reference or the MarkLogic Server-Side JavaScript Function Reference.

Function Description

clang:language-config-read (XQuery)

clang.languageConfigRead (JavaScript)

Read the current custom language configuration. You should always begin your configuration changes by calling this function.

clang:language-config-write (XQuery)

clang.languageConfigWrite (JavaScript)

Commit custom language configuration changes. Your changes will not take effect unless you call this function. Note: Calling this function restarts MarkLogic.

clang:language-config-delete (XQuery)

clang.languageConfigDelete (JavaScript)

Remove all custom language configuration from your MarkLogic installation. Note: Calling this function restarts MarkLogic.

clang:update-user-language (XQuery)

clang.updateUserLanguage (JavaScript)

Modify a language config element to add/replace configuration for a specific language. Your change will not take effect until you call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript).

clang:delete-user-language (XQuery)

clang.deleteUserLanguage (JavaScript)

Modify a language config element to remove configuration for a specific language. Your change will not take effect until you call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript).

clang:user-language (XQuery)

clang.userLanguage (JavaScript)

Construct a custom language-to-plugin binding that can be used to update the custom language configuration. This is the unit of change for clang:update-user-language and clang.updateUserLanguage.

clang:user-language-plugin (XQuery)

clang.userLanguagePlugin (JavaScript)

Construct a custom lexer/stemmer plugin reference that can be used to update the configuration for a language.

clang:lexer (XQuery)

clang.lexer (JavaScript)

Construct a reference to a lexer capability in a native plugin. Use the output of this function as input to clang:user-language-plugin or clang.userLanguagePlugin.

clang:stemmer (XQuery)

clang.stemmer (JavaScript)

Construct a reference to a stemmer capability in a native plugin. Use the output of this function as input to clang:user-language-plugin or clang.userLanguagePlugin.

Customization Using a Built-In Lexer or Stemmer

This sections describes how to construct a custom lexer or stemmer configuration item based on one of the built-in lexers or stemmers, rather than on a user-defined plugin.

Setting the library argument of clang:user-language-plugin or clang.userLanguagePlugin to an empty string tells MarkLogic you are referencing a built-in plugin. For example, the following call constructs a stemmer configuration item based on the built-in Snowball stemmer. Notice that the first parameter (library) is an empty string.

XQuery: clang:user-language-plugin("",(),clang:stemmer("snowball"))

JavaScript: clang.userLanguagePlugin('', null, clang.stemmer('snowball')

The first argument of the stemmer constructor should be one of the built-in stemmer names from Built-in Stemmer Plugin Reference. You can configure a custom lexer at the same time by including a clang.lexer or clang.lexer configuration item as the 3rd parameter.

If you associate a custom lexer dictionary with a language, you must reinstall it if you change the lexer plugin for the language. Similarly, if you associate a custom stemming dictionary with a language, you must reinstall it if you change the stemmer plugin for the language.

The following example creates a configuration item for German. The default lexer for German is ICU. The default stemmer for German is Bitext. The new configuration specifies Snowball as the custom lexer and leaves the default lexer unchanged. In addition, the Snowball stemmer is configured to use the german2 stemming algorithm.

Language Example
XQuery
xquery version "1.0-ml";
import module namespace clang =
  "http://marklogic.com/xdmp/custom-language" 
  at "/MarkLogic/custom-language.xqy";

let $stemmer :=
  clang:stemmer("snowball",(),("code=german2"))
let $plugin := clang:user-language-plugin("",(),$stemmer)
let $german := clang:user-language("de",$plugin)

return 
  clang:update-user-language(
    clang:language-config-read(), $german)
Server-Side JavaScript
'use strict';
const clang = require('/MarkLogic/custom-language');

const stemmer = clang.stemmer(
  'snowball', null, Sequence.from(['code=german2']));
const plugin = clang.userLanguagePlugin('', null, stemmer);
const germanConfig = clang.userLanguage('de', plugin);

clang.updateUserLanguage(
    clang.languageConfigRead(), germanConfig);

Note that this example doesn't actually change the language configuration because it does not call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript).

If you run the example in Query Console, you should see output similar to the following:

<lang:user-languages xml:lang="zxx" xmlns:lang="http://marklogic.com/xdmp/language">
  <lang:user-language>
    <lang:name>de</lang:name>
    <lang:plugin>
      <lang:library/>
      <lang:stemmer>
        <lang:variant>snowball</lang:variant>
        <lang:arg>code=german2</lang:arg>
      </lang:stemmer>
    </lang:plugin>
  </lang:user-language>
</lang:user-languages>

Customization Using a User-Defined Lexer or Stemmer

This describes how to construct a custom lexer or stemmer configuration item based on a user-defined plugin, rather than on one of the built-in plugins. A user-defined lexer or stemmer must be installed as a native plugin before you can use it.

When you construct a lexer (or stemmer) configuration item for a user-defined plugin, you must identify the native plugin and the capability from the plugin library that exposes the LexerUDF or StemmerUDF implementation.

For a lexer, set the variant argument of clang:lexer or clang.lexer to a LexerUDF capability registered by plugin. For a stemmer, set the variant argument of clang:stemmer or clang.stemmer to a StemmerUDF capability registered by plugin. For both, set the library argument to plugin_path/plugin_id.

For example, if you install a plugin with the path native and plugin id sampleplugin, and the lexer UDF capability registered by the plugin is named sample_lexer, then you'd construct a lexer configuration item for it as follows:

XQuery: clang:lexer("sample_lexer", (), (), "native/sampleplugin")

JavaScript: clang.lexer('sample_lexer', null, null, 'native/sampleplugin')

If you configure both a stemmer and lexer from the same native plugin, you can set the plugin library reference (native/sampleplugin) in clang:user-language-plugin or clang.userLanguagePlugin instead. For example:

XQuery: clang:user-language-plugin(
          "native/sampleplugin", 
          clang:lexer("sample_lexer"),
          clang:stemmer("sample_stemmer"))

JavaScript: clang.userLanguagePlugin('
              'native/sampleplugin', 
              clang.lexer('sample_lexer'),
              clang.stemmer('sample_stemmer'));

When a library is specified in both the lexer/stemmer constructor and the language plugin constructor, the library in the lexer/stemmer takes precedence.

If you associate a custom lexer dictionary with a language, you must reinstall it if you change the lexer plugin for the language. Similarly, if you associate a custom stemming dictionary with a language, you must reinstall it if you change the stemmer plugin for the language.

The following example creates a configuration item for German. The default lexer for German is ICU. The default stemmer for German is Bitext. The new configuration specifies a user-defined lexer named sample_lexer as the custom lexer and leaves the default stemmer unchanged. Assume the plugin configuration described above.

Language Example
XQuery
xquery version "1.0-ml";
import module namespace clang =
  "http://marklogic.com/xdmp/custom-language" 
  at "/MarkLogic/custom-language.xqy";

let $lexer :=
  clang:lexer("sample_lexer",(),(), "native/sampleplugin")
let $plugin := clang:user-language-plugin("", $lexer, ())
let $german := clang:user-language("de", $plugin)

return 
  clang:update-user-language(
    clang:language-config-read(), $german)
Server-Side JavaScript
'use strict';
const clang = require('/MarkLogic/custom-language');

const lexer = clang.lexer(
  'sample_lexer', null, null, 'native/sampleplugin');
const plugin = clang.userLanguagePlugin('', lexer, null);
const germanConfig = clang.userLanguage('de', plugin);

clang.updateUserLanguage(
    clang.languageConfigRead(), germanConfig);

Note that this example doesn't actually change the language configuration because it does not call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript).

If you run the example in Query Console, you should see output similar to the following:

<lang:user-languages xml:lang="zxx"
    xmlns:lang="http://marklogic.com/xdmp/language">
  <lang:user-language>
    <lang:name>de</lang:name>
    <lang:plugin>
      <lang:library/>
      <lang:lexer>
        <lang:library>native/sampleplugin</lang:library>
        <lang:variant>sample_lexer</lang:variant>
      </lang:lexer>
    </lang:plugin>
  </lang:user-language>
</lang:user-languages>

Example: Adding Configuration for a Language

Use the clang:user-language-plugin XQuery function or the clang.userLanguagePlugin Server-Side JavaScript function to define a binding between a language and custom tokenization and stemming plugins. For more details, see Customization Using a Built-In Lexer or Stemmer and Customization Using a User-Defined Lexer or Stemmer.

To put the configuration change into effect, use the following pattern. A complete example follow.

Language Example
XQuery
clang:language-config-write(
  clang:update-user-language(
    clang:language-config-read(), $changed-lang)
  )
JavaScript
clang.languageConfigWrite(
  clang.updateUserLanguage(
    clang.languageConfigRead(), changedLang));

This operation is an overwrite: Any previous configuration for the language will be replaced. Thus, if you are going to configure both a lexer and a stemmer for a language, do it in a single call to clang:update-user-language or clang.updateUserLanguage.

Calling clang:language-config-write or clang.languageConfigWrite causes MarkLogic to restart.

The following example configures a custom stemmer and lexer for the Catalan language using a user-defined plugin. Assume the plugin registers a lexer named sample_lexer and a stemmer named sample_stemmer.

Language Example
XQuery
xquery version "1.0-ml";
import module namespace clang="http://marklogic.com/xdmp/custom-language"
  at "/MarkLogic/custom-language.xqy";

(: Construct custom lexer and stemmer bindinf for Catalan :)
let $catalan :=
  clang:user-language("ca",
    clang:user-language-plugin("native/sampleplugin",
      clang:lexer("sample_lexer"),
      clang:stemmer("sample_stemmer")
    )
  )
(: Get the existing config so we update it :)
let $existing :=  clang:language-config-read()
return
  (: Update the current config and commit the changes :)
  (: NOTE: Causes a restart :)
  clang:language-config-write(
    clang:update-user-language($existing, $catalan)
  )
JavaScript
'use strict';
const clang = require('/MarkLogic/custom-language.xqy');

// Construct a custom lexer and stemmer binding for Catalan
const catalan = clang.userLanguage(
  'ca', 
  clang.userLanguagePlugin(
    'native/sampleplugin', 
    clang.lexer('sample_lexer'), 
    clang.stemmer('sample_stemmer'))
  );
// Get the existing config so we can update it
const existing = clang.languageConfigRead();

// Update the current config and commit the changes.
// NOTE: Causes a restart
clang.languageConfigWrite(
  clang.updateUserLanguage(existing, catalan));

You can configure just a lexer or just a stemmer for a language by including just that reference when calling clang:user-language-plugin or clang.userLanguagePlugin. For example, the following code only configures a custom stemmer.

Language Example
XQuery
xquery version "1.0-ml";
import module namespace clang="http://marklogic.com/xdmp/custom-language"
  at "/MarkLogic/custom-language.xqy";

let $catalan :=
  clang:user-language("ca",
    clang:user-language-plugin("native/sampleplugin",
      (), clang:stemmer("sample_stemmer")
    )
  )
let $existing :=  clang:language-config-read()
return
  clang:language-config-write(
    clang:update-user-language($existing, $catalan)
  )
JavaScript
'use strict';
const clang = require('/MarkLogic/custom-language.xqy');

const catalan = clang.userLanguage(
  'ca', 
  clang.userLanguagePlugin(
    'native/sampleplugin', null,
    clang.stemmer('sample_stemmer'))
  );
const existing = clang.languageConfigRead();

clang.languageConfigWrite(
  clang.updateUserLanguage(existing, catalan));

To configure a lexer and stemmer from different plugin libraries for the same language, specify the plugin path to the lexer and stemmer reference constructors. For example, the following code configures a lexer and a stemmer from two different plugins:

Language Example
XQuery
xquery version "1.0-ml";
import module namespace clang="http://marklogic.com/xdmp/custom-language"
  at "/MarkLogic/custom-language.xqy";

let $catalan :=
  clang:user-language("ca",
    clang:user-language-plugin("",
      clang:lexer("my_lexer",(),(), "plugin1/lexers"),
      clang:stemmer("my_stemmer", (), (), "plugin2/stemmers")
    )
  )
let $existing :=  clang:language-config-read()
return
  clang:language-config-write(
    clang:update-user-language($existing, $catalan)
  )
JavaScript
'use strict';
const clang = require('/MarkLogic/custom-language.xqy');

const catalan = clang.userLanguage(
  'ca', 
  clang.userLanguagePlugin(
    '', 
    clang.lexer('my_lexer', null, null, 'plugin1/lexers'), 
    clang.stemmer('my_stemmer', null, null, 'plugin2/stemmers'))
  );
const existing = clang.languageConfigRead();

clang.languageConfigWrite(
  clang.updateUserLanguage(existing, catalan));

Example: Removing Configuration for a Language

Use the clang:delete-user-language XQuery function or the clang.deleteUserLanguage JavaScript function to remove the custom configuration for a specific language. You must call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript) for your change to take effect, and doing so will cause MarkLogic to restart.

The following example removes the configuration for the Catalan language (language code ca).

Language Example
XQuery
xquery version "1.0-ml";
import module namespace clang="http://marklogic.com/xdmp/custom-language"
  at "/MarkLogic/custom-language.xqy";

let $language := "ca"
let $existing :=  clang:language-config-read()
return
  clang:language-config-write(
    clang:delete-user-language($existing, $language)
  )
JavaScript
'use strict';
const clang = require('/MarkLogic/custom-language.xqy');

const language = 'ca';
const existing = clang.languageConfigRead();

clang.languageConfigWrite(
  clang.deleteUserLanguage(existing, language));

Example: Resetting Configuration for All Languages

To remove custom stemmer and lexer bindings for all languages, use the clang:language-config-delete XQuery function or the clang.languageConfigDelete Server-Side JavaScript function.

Calling these functions restarts MarkLogic.

The following example code removes all language customizations and restarts the server.

Language Example
XQuery
xquery version "1.0-ml";
import module namespace clang="http://marklogic.com/xdmp/custom-language"
  at "/MarkLogic/custom-language.xqy";

clang:language-config-delete()
JavaScript
'use strict';
const clang = require('/MarkLogic/custom-language.xqy');

clang.languageConfigDelete();

Understanding Stemming Delegation

You can use delegation to control whether stemming consults the default stemmer in addition to the custom plugin for a language. Delegation can be controlled at two levels:

  • User-defined stemming plugins and some built-in plugins include a delegation control on their interface. For example, the built-in Bitext plugin accepts a delegation option, and the StemmerUDF interface for user-defined plugins has a delegate method.
  • The clang:user-language-plugin XQuery function and the clang.userLanguagePlugin JavaScript function accept a boolean delegate parameter. When set to true (the default), the stemming process asks the plugin whether or not to delegate.

If delegation is enabled at the language plugin configuration level, then the stemming process will consult the custom plugin about whether or not to delegate. For example, it will call the delegate method on StemmerUDF. If delegation is disabled at the language plugin configuration level, then the stemming process will not consult the custom plugin and will never delegate to the default stemmer.

Delegation has the following effect on the stemming results. The first column indicates whether or not the delegate parameter of clang:user-language-plugin or clang.userLanguagePlugin is set to true. The second column indicates whether or not the custom plugin agrees to delegate; for example, whether StemmerUDF::delegate returns true.

Lang Config Delegate Plugin Says to Delegate Result
true true

stems = (stems from plugin + stems from default)

if no stems are found, word is self-stemming

true false

stems = stems from plugin

if no stems are found, word is self-stemming

false N/A

stem = stem from plugin

if no stems are found, word is self-stemming

The following table contains examples of the stemming result with various delegation and stem count combinations. The with Delegation column signifies the plugin was consulted and agreed to delegation. The without Delegation column signifies either the plugin was not consulted or the plugin did not agree to delegation.

Input Custom Plugin Stems Default Stems Final Result with Delegation Final Result without Delegation
moogled moogle moogle moogle
pabbling peeble pabble peeble, pabble peeble
furben furby furby furben (self-stem)
zorks zorks (self-stem) zorks (self-stem)

A custom plugin determines its own delegation policy. For example, a plugin might choose among policies such as always (delegate regardless of the number of stems found), never (never delegate), or on empty (delegate only if the plugin found no stems).

Custom Dictionary Security Considerations

When you configure a language to use a custom user-defined stemmer or lexer, and also associate a custom dictionary with the language, then you can create special security privileges to enable finer control over who can administer the dictionary.

A custom dictionary is associated with both a language and a specific stemmer or lexer plugin. The lexer or stemmer is implicit in the configuration of the language. Usually, any user with the custom-dictionary-admin role or equivalent privileges can add, update, or delete a custom dictionary for any language-stemmer or language-lexer configuration.

You can create a privileges of the following form to make it possible to control dictionary management on a per stemmer/lexer basis.

http://marklogic.com/xdmp/privileges/custom-dictionary-admin/library
http://marklogic.com/xdmp/privileges/xdmp-write-cluster-config-file/library

Where library is of the form plugin_path/plugin_id and identifies a user-defined lexer or stemmer plugin.

For example, if you install a user-defined lexer plugin with the plugin path native and the plugin id sampleplugin, then you would create a privileges of the following form:

http://marklogic.com/xdmp/privileges/custom-dictionary-admin/native/sampleplugin
http://marklogic.com/xdmp/privileges/xdmp-write-cluster-config-file/native/sampleplugin

MarkLogic will not create these privilege for you, but it will check for and enforce them if the privileges exist.

Built-in Lexer Plugin Reference

The table below lists the built-in lexers (tokenizers), which languages each one is configured for by default, and what configuration options (if any) are available for customization. Use the configuration lexer name and configuration options when calling the clang:lexer XQuery function or the clang.lexer JavaScript function; for details, see Customization Using a Built-In Lexer or Stemmer.

Lexer Name Description
simple lexer The default lexer for English, Norwegian, and languages without advanced support. You cannot specify this lexer as a custom plugin, and it has no configuration options.
icu Default tokenizer for most licensed languages. Users might want to switch to this tokenizer for English to pick up better apostrophe and contraction handling, or for languages without advanced support. This lexer accepts no extra arguments.
kytea Default tokenizer for Chinese. With the appropriate language model, this tokenizer could be used for other languages. You can customize the behavior of this lexer using the following arguments:
model_filename

Required. The name of a model file in the MARKLOGIC_DIR/Lang directory of all hosts in the cluster. The model is used for tokenization. Only UTF-8 models are supported. Create a model file using the KyTea tools on a corpus, possible augmented with dictionaries.

KyTea offers Japanese models alternative models for Chinese at http://www.phontron.com/kytea/model.html.

atilika Default tokenizer for Japanese. You can customize the behavior of this lexer using the following arguments:
search-mode
normal-mode
Specify the handling of compound words: search-mode breaks up compound words, while normal-mode does not. Default: search-mode.

Built-in Stemmer Plugin Reference

The tables below list the built-in stemmers. Use the stemmer name and options when constructing a stemmer configuration item using the clang:stemmer XQuery function or the clang.stemmer JavaScript function.

MarkLogic uses the following built-in stemmers by default:

Stemmer Name Description
simple stemmer The default stemmer for languages without advanced stemming support.
bitext Default stemmer for all languages with advanced stemming support except Chinese and Japanese. Chinese is not stemmed, and Japanese uses the Atilika stemmer.
snowball Default stemmer for Danish, Finnish, Hungarian, Romanian, Tamil, and Turkish.
atilika Default stemmer for Japanese.

See the following topics for configuration options.

Bitext Stemmer Options

The Bitext stemmer supports the following options that you can specify in the args parameter of clang:stemmer or clang.stemmer.

Bitext Option Description
code=value A language code to be passed to Bitext. This is a 3-letter code, such as DEU for German.
dict=value Which dictionary to use. You can specify multiple dictionaries by specifying this argument multiple times. The dictionary must be in MARKLOGIC_DIR/Lang on all hosts in the cluster. The dictionary must be in Bitext's format.
decompounding
no-decompounding
Enable/disable decompounding. If the language does not support decompounding, this is a no-op. Default: no-decompounding.
delegation=value Whether to delegate to the base stemmer, if there is one. Allowed values: always (always delegate, meaning Bitext stems are always added to the base stemmer), on-empty (delegate to the base stemmer only if the Bitext dictionary had no entry for the word), or never (no delegation). Default: on-empty.
algorithm=value

Which stemming algorithm to use. If not specified, MarkLogic uses the default algorithm for the language.

Choose from the following values: arabic, danish, dutch, english, finnish, french, german, german2, hungarian, italian, porter (Porter algorithm for English), portuguese, romanian, russian, spanish, swedish, turkish, tamil, persian, korean, english2, french2, german3, italian2, spanish2, swedish2. The values english2, french2, german3, italian2, spanish2, and swedish2 specify a lemmatizing algorithm for that language, for use with Bitext dictionaries.

pre-stemmer=value Which pre-stemming algorithm to use. Pre-stemmers perform normalization on the input to make better use of the Bitext dictionaries. Choose one of the following values: normalize_latin (map fullwidth characters to regular Latin character; map ligatures to their components), arabic_transliteration (transliterate Arabic characters to ASCII. Required for Arabic since it uses transliterated dictionaries.).
use-algorithm
no-use-algorithm
Enable/disable the stemming algorithm backing the Bitext dictionary. Default: use-algorithm. Does not apply to the pre-stemmer.
use-dictionary
no-use-dictionary
Enable/disable the Bitext dictionary. Default: use-dictionary (look up entries in the dictionary). If the dictionary is disabled, the stemmer will perform pre-stemming and (if the algorithm is enabled) stemming.
lowercase
no-lowercase
Enable/disable lowercasing of the input string. Many of the standard algorithms use uppercase letters as markers and will not work properly if there are uppercase letters in the input. Default: no-lowercase.
nfkd
no-nfkd
Enable/disable NFKD normalization of the input string. Some stemming algorithms do not work correctly when the input has been NFKD normalized. Default: no-nfkd.
Snowball Stemmer Options

The Snowball stemmer supports the following options that you can specify in the args parameter of clang:stemmer or clang.stemmer.

Snowball Option Description
code=value

Which stemming algorithm to use. Optional. If unspecified, use the default algorithm for the language.

Choose from the following values: arabic, danish, dutch, english, finnish, french, german, german2, hungarian, italian, porter (Porter algorithm for English), portuguese, romanian, russian, spanish, swedish, turkish, tamil, persian, korean, english2, french2, german3, italian2, spanish2, swedish2. The values english2, french2, german3, italian2, spanish2, and swedish2 specify a lemmatizing algorithm for that language, for use with Bitext dictionaries.

Atilika Stemmer Options

The Atilika stemmer supports the following options that you can specify in the arguments parameter of clang:stemmer or clang.stemmer.

Atilika Option Description
add-reading
no-add-reading
Specify whether or not to add the Katakana reading as an alternative stem. Default: no-add-reading. (MarkLogic's default configuration for Atilika uses add-reading, but if you're configuring Atilika as a custom plugin, the default is no-add-reading.)
delegation
Whether or not to delegate to the base stemmer. Allowed values: never, always, on-empty. Default: on-empty.

Language Support in JSON

Overview

Beginning with version 10.0-1 of our server, MarkLogic allows natural language in JSON to be tagged with a language other than the default database language. When a language or lang tag is present in a JSON object, all textual content in that object will be interpreted as being processed under the language referred to by the ISO code in that tag. JSON language processing is very similar to XML language processing (see the ksjdf section, for details) with the following differences:

A JSON document is allowed to have multiple language or lang tags in its content. A JSON node containing the key language will be processed according to that language. All descendant nodes will be processed according to that language. Language tags may be placed at any level in the JSON and are applied in a simple hierarchical way.

{
    language: "en-US", description: "This is US English text",
    components: [
        Still US english,
        {
            language: nl,
            data: Dutch stuff
        }
        {
            language: es,
            data: Spanish stuff
        }
        {
            data: More US English
        }
    ]
}

In the above example, content indexed with a particular language will have the key for that language added to the re-indexer keys stored with the document, as is now the case with XML content.

When JSON is being parsed (for example from a file), making the language tag apply to preceding siblings would be expensive and require us to parse the whole object before doing any node construction on it. Serialization will put the language child first.

API Changes

The fn:lang function and the underlying datamodel functions that support it now handle JSON nodes as well as XML nodes.

fn:lang($testlang as xs:string?, [$node as node()]) as xs:boolean

The function fn:lang already exists with the above signature. In previous versions, it always returns false for a JSON node. Starting with version 10.0-1 of MarkLogic, it will return true if the JSON node or one of its ancestors has a lang or language key that matches the $testlang per the rules defined for the xml:lang attribute on XML nodes.

For example, the following will return true.

fn:lang("en", object-node{ "language" : "en-US", "item" : "example" } )

JSON Serialization

Serialization of JSON objects will put the language tag first. (with the limitations noted above).

For example:

xdmp:to-json( object-node{ "item" : "example", "language" : "en-US" } )

will return

{"language":"en-US", "item":"example"}

Upgrade Considerations

It is possible that you may already have JSON documents may already have language or lang properties used for some other purpose. In that case, normal language processing attempts to look up a given language, and will treat all unknown tags as equivalent. The content of the language property itself will still be indexed normally: the issue is that content will be indexed as "unknown language" instead of "default database language". That is a potential incompatibility, and a potential risk. This risk is attenuated by the fact that some JSON formats already use language or lang for precisely the purpose we want gives us some comfort that this will not be an issue in practice. In addition, MarkLogic will only attempt to apply a language or lang property if the node is a text node.

« Previous chapter
Next chapter »