MarkLogic Server supports loading and querying content in multiple languages. This chapter describes how languages are handled in MarkLogic Server, and includes the following sections:
In MarkLogic Server, the language of the content is specified when you load the content and the language of the query is specified when you query the content. At load-time, the content is tokenized, indexed, and stemmed (if enabled) based on the language specified during the load. Also, MarkLogic Server uses any languages specified at the element level in the XML markup for the content (see xml:lang Attribute), making it possible to load documents with multiple languages.
Similarly, at query time, search terms are tokenized (and stemmed) based on the language specified in the cts:query
expression. The result is that a query performed in one language might not yield the same results as the same query performed in another language, as both the indexes that store the information about the content and the queries against the content are language-aware.
Even if your content is entirely in a single language, MarkLogic Server is still multiple-language aware. For MarkLogic to behave as if there is only a single language, all the following must be true:
xml:lang
attributes.Because MarkLogic Server is multiple-language aware, it is important to understand the fundamental aspects of languages when loading and querying content in MarkLogic Server. The remainder of this chapter describes these details, particularly the following topics:
To understand the language implications of querying and loading documents, you must first understand tokenization and stemming, which are both language-specific. This section describes these topics, and has the following parts:
When you search for a string (typically a word or a phrase) in MarkLogic Server, or when you load content (which is made up of text strings) into MarkLogic Server, the string is split into parts, each of which is called a token. Each token is classified as a word, punctuation, or whitespace. The process of breaking down strings into tokens is called tokenization.
Tokenization occurs during document loading as well as during query evaluation. The two processes are independent of each other. The tokenization of documents during loading affects indexing. The tokenization of query text affects how search terms are resolved. Though the processes are independent, they use the same tokenizer (for a given language).
Tokenization is language-specific; that is, a given string is tokenized differently depending on the language in which it is tokenized. The language is determined based on the language specified at load or query time (or the database default language if no language is specified) and on any xml:lang
attributes in the content (for details, see xml:lang Attribute).
Note the following about the way strings are tokenized in MarkLogic Server:
xdmp:describe(cts:tokenize("this is, obviously, a phrase", "en"), 100) => (cts:word("this"), cts:space(" "), cts:word("is"), cts:punctuation(","), cts:space(" "), cts:word("obviously"), cts:punctuation(","), cts:space(" "), cts:word("a"), cts:space(" "), cts:word("phrase"))
cts:query
expression, then it takes on the default language of the database.let $x := <el xml:lang="zh">Chinese-text-here hello</el> return $x//el[cts:contains(., cts:word-query("hello", ("stemmed", "lang=en")))] => <el xml:lang="zh">Chinese-text-here hello</el>
A stemmed search for the Latin characters in a non-English language, however, will not find the non-English word stems (it will only find the non-English word itself, which stems to itself). Similarly, Asian or Middle Eastern characters will tokenize in a language appropriate to the character set, even when they occur in elements that are not in their language. The result is that searches in English sometimes match content that is labeled in an Asian or Middle Eastern character set, and vice-versa. For example, consider the following (zh
is the language code for Simplified Chinese):
let $x := <root> <el xml:lang="en">hello</el> <el xml:lang="fr">hello</el> <el xml:lang="zh">hello</el> </root> return $x//el[cts:contains(., cts:word-query("hello", ("stemmed", "lang=en")))] => <el xml:lang="en">hello</el> <el xml:lang="zh">hello</el>
This search, even though in English, returns both the element in English and the one in Chinese. It returns the Chinese element because the word hello is in Latin characters and therefore tokenizes as English, and it matches the Chinese query (which also tokenizes hello in English).
A stemmed search for a term matches all the terms that have the same stem as the search term (which includes the exact same terms in the language specified in the query). The purpose of stemming is to increase the recall for a search. For details about how stemming works in MarkLogic Server, including the different types of stemming available, see Understanding and Using Stemmed Searches. This section describes how the language settings affect stemmed searches.
Words derived from the same meaning and part of speech have the same stem (for example, mouse and mice). A word can have multiple stems if the word can be used as multiple parts of speech (for example, play can be both a noun and a verb in English), or if there are two words with the same spelling. If you enable advanced stemming, then stemmed searches find all of the words having the same stem as any of the stems. Advanced stemming finds multiple stems for a word.
Stemming is a language-specific operation. For example, the word chat is a different word in French than it is in English. In French, chat is a noun meaning cat, while in English, it is a verb. In French, chatting is not a word, and therefore it does not stem to chat. But in English, chatting does stem to chat. Therefore, stemmed searches in one language might find different results than stemmed searches in another.
When you construct a query, you can specify a language to use for stemmed search. For example, the following cts:query
expression specifies a stemmed search in French for the word chat, and it only matches tokens that are stemmed in French.
cts:word-query("chat", ("stemmed", "lang=fr"))
For more details about how languages affect queries, see Querying Documents By Languages.
At load time, the specified language is used to determine in which language to stem the words in the document. For more details about the language aspects of loading documents, see Language Aspects of Loading and Updating Documents.
For details about the syntax of the various cts:query
constructors, see the MarkLogic XQuery and XSLT Function Reference.
This section describes the impact of languages on loading and updating documents, and includes the following sections:
Tokenization and stemming occur when loading documents, just as they do when querying documents (for details, see Language-Specific Tokenization and Stemmed Searches in Different Languages). When loading documents, the stemmed search
indexes are created based on the language. The tokenization and stemming at load time is completely independent from the tokenization and stemming at query time.
You can specify languages in XML documents at the element level by using the xml:lang
attribute. MarkLogic Server uses the xml:lang
attribute to determine the language with which to tokenize and stem the contents of that element. Note the following about the xml:lang
attribute:
xml:lang
attribute (see https://www.w3.org/TR/xml/#sec-lang-tag) has some special properties such as not needing to declare the namespace bound to the xml
prefix, and that it is inherited by all children of the element (unless they explicitly have a different xml:lang
value).xml:lang
attribute to the root node of an XML document during loading by specifying the default-language
option to xdmp:document-load; without the default-language
option, the root node will remain as-is.xml:lang
attribute is present, then the document is processed in the default language of the database into which it is loaded.xml:lang
attribute only applies to stemmed search terms; the word searches
(unstemmed) database configuration setting indexes terms irrespective of language. Tokenization of terms honors the xml:lang
value for both stemmed searches
and word searches
index settings in the database configuration.xml:lang
attribute are treated as the language specified in the xml:lang
attribute, unless a child element has an xml:lang
attribute with a different value. If so, any text node children and text node descendants are treated as the new language, and so on until no other xml:lang
attributes are encountered. xml:lang
attribute must conform to the following lexical standard: http://www.ietf.org/rfc/rfc3066.txt. The following are some typical xml:lang
attributes (specifying French, Simplified Chinese, and English, respectively):xml:lang="fr" xml:lang="zh" xml:lang="en"
xml:lang
attribute with a value of the empty string (xml:lang=""
), then any xml:lang
value in effect (from some ancestor xml:lang
value) is overridden for that element; its value takes on the database language default. Additionally, if a default-language
option is specified during loading, any empty string xml:lang
values are replaced with the language specified in the default-language
option. For example, consider the following XML:<rhone xml:lang="fr"> <wine>vin rouge</wine> <wine xml:lang="">red wine</wine> </rhone>
In this sample, the phrase vin rouge is treated as French, and the phrase red wine is treated in the default language for the database (English by default).
If this sample was loaded with a default-language option specifying Italian (specifying <default-language>it</default-language>
for the xdmp:document-load
option, for example), then the resulting document would be as follows:
<rhone xml:lang="fr"> <wine>vin rouge</wine> <wine xml:lang="it">red wine</wine> </rhone>
When you load content into MarkLogic Server, it determines how to index the content based on several factors, including the language specified during the load operation, the default language of the database, and any languages encoded into the content with xml:lang
attributes. Note the following about languages with respect to loading content, updating content, and changing language settings on a database:
reindex enable
is set to true
.xml:lang
attribute are indexed upon load or update in the database default language. xml:lang
attribute is indexed in that language. Additionally, the xml:lang
value is inherited by all of the descendants of that element, until another xml:lang
value is encountered.Full-text search queries (queries that use cts:search or cts:contains) are language-aware; that is, they search for text, tokenize the search terms, and stem (if enabled) in a particular language. This section describes how queries are language-aware and describes their behavior. It includes the following topics:
Tokenization and stemming are both language-specific; that is, a string can be tokenized and stemmed differently in different languages. By default, a query uses the default language of the database. You can also specify a language when constructing a query. For more details, see Tokenization and Stemming.
For XML nodes constructed in XQuery, any xml:lang
attributes are treated the same way as if the document were loaded into a database. For details, see xml:lang Attribute.
Constructed JSON nodes use the default language configured for the database.
All searches in MarkLogic Server are language-aware. You can specify a language when constructing a query. For example, most cts:query constructors accept a language option. If the language is not explicitly specified, MarkLogic uses the default language configured for the database. For details on the cts:query
constructors, see Composing cts:query Expressions.
The language governing a query determines how to tokenize the search terms, whether stemmed search is enabled or not. If stemmed search is enabled, the language is also used to derive stems. Unstemmed searches use the unstemmed (word searches
) indexes, which are language independent.
An unstemmed search matches terms that are exactly like the search term; it does not take into consideration the stem of the word. Unstemmed searches match terms in a language independent way, but tokenize the search according to the specified language. Therefore, when you specify a language in an unstemmed query, the language applies only to tokenization; the unstemmed query will match any text in any language that matches the query.
Note the following characteristics of unstemmed searches:
word search
indexes, otherwise they throw an exception. However, you can perform unstemmed searches without word search
indexes using cts:contains. To perform unstemmed searches without the word search
indexes enabled, use a let
to bind the results of a stemmed search to a variable, and then filter the results using cts:contains with an unstemmed query.The following example binds the stemmed search results to a variable, then iterates over the results, filtering out all but the unstemmed results in the where
clause (using cts:contains
with a cts:query
that specifies the unstemmed
option).
let $search := cts:search(doc(), cts:word-query("my words", ("stemmed", "lang=en"))) for $x in $search where cts:contains($x, cts:word-query("my words", "unstemmed")) return $x
While it is likely that everything returned by this search will have an English match to the cts:query
, it is not guaranteed that everything returned is in English. It is possible for a document to contain words in another language that do not match the language-specific query, but do match the unstemmed query (if the document contains text in multiple languages, and if it has my words in some other language than the one specified in the stemmed cts:query
).
word search
indexes are language-agnostic.lang=<language>
query constructor option to determine the language for tokenization.lang=<language>
option). The language only affects how the search terms are tokenized. For example, the following unstemmed search returns true:(: returns true :) let $x := <el xml:lang="fr">chat</el> return cts:contains($x, cts:word-query("chat", ("unstemmed", "lang=en")))
whereas the following stemmed search returns false:
(: returns false :) let $x := <el xml:lang="fr">chat</el> return cts:contains($x, cts:word-query("chat", ("stemmed", "lang=en")))
If the language specified in a search is not one of the languages in which language-specific stemming and tokenization are supported, or if it is a language for which you do not have a license key, then it is treated as a generic language. Typically, generic languages with Latin script are tokenized the same way as English, with token breaks at whitespace and punctuation, and with each word stemming to itself, but this is not always the case (especially for languages supported by MarkLogic Server--see Supported Languages--but for which you are not licensed). For details, see Generic Language Support.
You can implement a custom lexer (for tokenization) and stemmer if the default behavior for an unsupported language does not meet the needs of your application. For details, see User-Defined Lexer Plugins and Using a User-Defined Stemmer Plugin.
This section lists languages with advanced stemming and tokenization support in MarkLogic Server. All of the languages except English require a license key with support for the language. If your license key does not include support for a given language, the language is treated as a generic language (see Generic Language Support). The following are the supported languages:
For a list of base collations and character sets used with each language, see Collations and Character Sets By Language.
You can load and query documents in any language into MarkLogic Server, as long as you can convert the character encoding to UTF-8. If the language is not one of the languages with advanced support, or if the language is one for which you are not licensed, then the tokenization is performed in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters), and each term stems to itself.
For example, if you load the following document:
<doc xml:lang="cz"> <a>Some text in any language here.</a> </doc>
then that document is loaded as the language cz
, and a stemmed search in any other language would not match. Therefore, the following does not match the document:
(: does not match because it was stemmed as "cz" :) cts:search(doc(), cts:word-query("language", ("stemmed", "lang=en"))
The following search does match the document because it uses the same language:
(: does match because the query specifies "cz" as the language :) cts:search(doc(), cts:word-query("language", ("stemmed", "lang=nn"))
For generic language searches in languages for which MarkLogic does not provide advanced language support (the languages described in Supported Languages), those languages are all treated as the same language for stemmed searches. Therefore, a stemmed search that matches a document in one language without advanced language support will also match a document in another language without advanced language support.
Generic language support enables you to query documents in any language, regardless of which languages you are licensed for or which languages have advanced support. Because the generic language support only stems words to themselves, queries in these languages will not include variations of words based on their meanings in the results.
If you desire more than the generic language support for some unsupported language, you can create a custom lexer and or stemmer plugin to enable language-specific handling. For details, see Stemming and Tokenization Customization.
This section summarizes the features available to you for customizing the stemming and tokenization processes. You can use these features separately or together.
With no customizations, each language has a default lexer and default tokenization dictionary associated with it. The default lexer is one of the built-in lexers described in Built-in Lexer Plugin Reference and varies by language.
You can use the following tools to customize tokenization. You can use these features singly or in combination.
marklogic::LexerUDF
native C++ intefaces. You associate a custom lexer with a specific language. For details, see User-Defined Lexer Plugins and Configuring Tokenization and Stemming Plugins.You can use tokenization customizations in conjunction with stemming customizations. For details, see Stemming Customization.
Tokenization is a trusted operation. You should be selective about which users can register user-defined lexer plugins and customize language configurations.
With no customizations, each language has a base stemmer and stemming dictionary associated with it. The default stemmer is one of the built-in stemmer plugins that come with MarkLogic, and varies by language. For details, see Built-in Stemmer Plugin Reference.
You can use the following tools to customize stemming. You can use these customizations singly or in combination.
marklogic::StemmerUDF
native C++ intefaces. You associate a custom stemmer with a specific language. For details, see Using a User-Defined Stemmer Plugin and Configuring Tokenization and Stemming Plugins.You can use stemming customizations in conjunction with tokenization customizations. For details, see Tokenization Customization.
Stemming is a trusted operation. You should be selective about which users can register user-defined stemming plugins and customize language configurations.
One way you can affect the results of tokenization and stemming is to configure a custom lexer or stemmer plugin for a language. Your customization can use either a built-in or user-defined plugin.
This section provides an overview of how to configure a custom lexer or stemmer for a language using the Custom Language Management library module. The following topics are covered:
For more information on creating user-defined lexer and stemmer plugins, see the following topics:
Lexer and stemmer plugin configuration is done through the custom language management library module. The module includes the following functions. For more details, see the XQuery/XSLT Function Reference or the MarkLogic Server-Side JavaScript Function Reference.
Function | Description |
---|---|
clang:language-config-read (XQuery) clang.languageConfigRead (JavaScript) |
Read the current custom language configuration. You should always begin your configuration changes by calling this function. |
clang:language-config-write (XQuery) clang.languageConfigWrite (JavaScript) |
Commit custom language configuration changes. Your changes will not take effect unless you call this function. Note: Calling this function restarts MarkLogic. |
clang:language-config-delete (XQuery) clang.languageConfigDelete (JavaScript) |
Remove all custom language configuration from your MarkLogic installation. Note: Calling this function restarts MarkLogic. |
clang:update-user-language (XQuery) |
Modify a language config element to add/replace configuration for a specific language. Your change will not take effect until you call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript). |
clang:delete-user-language (XQuery) clang.deleteUserLanguage (JavaScript) |
Modify a language config element to remove configuration for a specific language. Your change will not take effect until you call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript). |
clang:user-language (XQuery) clang.userLanguage (JavaScript) |
Construct a custom language-to-plugin binding that can be used to update the custom language configuration. This is the unit of change for clang:update-user-language and clang.updateUserLanguage . |
clang:user-language-plugin (XQuery) clang.userLanguagePlugin (JavaScript) |
Construct a custom lexer/stemmer plugin reference that can be used to update the configuration for a language. |
clang:lexer (XQuery) clang.lexer (JavaScript) |
Construct a reference to a lexer capability in a native plugin. Use the output of this function as input to clang:user-language-plugin or clang.userLanguagePlugin. |
clang:stemmer (XQuery) clang.stemmer (JavaScript) |
Construct a reference to a stemmer capability in a native plugin. Use the output of this function as input to clang:user-language-plugin or clang.userLanguagePlugin. |
This sections describes how to construct a custom lexer or stemmer configuration item based on one of the built-in lexers or stemmers, rather than on a user-defined plugin.
Setting the library
argument of clang:user-language-plugin or clang.userLanguagePlugin to an empty string tells MarkLogic you are referencing a built-in plugin. For example, the following call constructs a stemmer configuration item based on the built-in Snowball stemmer. Notice that the first parameter (library
) is an empty string.
XQuery: clang:user-language-plugin("",(),clang:stemmer("snowball")) JavaScript: clang.userLanguagePlugin('', null, clang.stemmer('snowball')
The first argument of the stemmer constructor should be one of the built-in stemmer names from Built-in Stemmer Plugin Reference. You can configure a custom lexer at the same time by including a clang.lexer or clang.lexer configuration item as the 3rd parameter.
If you associate a custom lexer dictionary with a language, you must reinstall it if you change the lexer plugin for the language. Similarly, if you associate a custom stemming dictionary with a language, you must reinstall it if you change the stemmer plugin for the language.
The following example creates a configuration item for German. The default lexer for German is ICU. The default stemmer for German is Bitext. The new configuration specifies Snowball as the custom lexer and leaves the default lexer unchanged. In addition, the Snowball stemmer is configured to use the german2
stemming algorithm.
Note that this example doesn't actually change the language configuration because it does not call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript).
If you run the example in Query Console, you should see output similar to the following:
<lang:user-languages xml:lang="zxx" xmlns:lang="http://marklogic.com/xdmp/language"> <lang:user-language> <lang:name>de</lang:name> <lang:plugin> <lang:library/> <lang:stemmer> <lang:variant>snowball</lang:variant> <lang:arg>code=german2</lang:arg> </lang:stemmer> </lang:plugin> </lang:user-language> </lang:user-languages>
This describes how to construct a custom lexer or stemmer configuration item based on a user-defined plugin, rather than on one of the built-in plugins. A user-defined lexer or stemmer must be installed as a native plugin before you can use it.
When you construct a lexer (or stemmer) configuration item for a user-defined plugin, you must identify the native plugin and the capability from the plugin library that exposes the LexerUDF
or StemmerUDF
implementation.
For a lexer, set the variant
argument of clang:lexer or clang.lexer to a LexerUDF
capability registered by plugin. For a stemmer, set the variant
argument of clang:stemmer or clang.stemmer to a StemmerUDF
capability registered by plugin. For both, set the library
argument to plugin_path/plugin_id.
For example, if you install a plugin with the path native and plugin id sampleplugin, and the lexer UDF capability registered by the plugin is named sample_lexer, then you'd construct a lexer configuration item for it as follows:
XQuery: clang:lexer("sample_lexer", (), (), "native/sampleplugin") JavaScript: clang.lexer('sample_lexer', null, null, 'native/sampleplugin')
If you configure both a stemmer and lexer from the same native plugin, you can set the plugin library reference (native/sampleplugin) in clang:user-language-plugin or clang.userLanguagePlugin instead. For example:
XQuery: clang:user-language-plugin( "native/sampleplugin", clang:lexer("sample_lexer"), clang:stemmer("sample_stemmer")) JavaScript: clang.userLanguagePlugin(' 'native/sampleplugin', clang.lexer('sample_lexer'), clang.stemmer('sample_stemmer'));
When a library is specified in both the lexer/stemmer constructor and the language plugin constructor, the library in the lexer/stemmer takes precedence.
If you associate a custom lexer dictionary with a language, you must reinstall it if you change the lexer plugin for the language. Similarly, if you associate a custom stemming dictionary with a language, you must reinstall it if you change the stemmer plugin for the language.
The following example creates a configuration item for German. The default lexer for German is ICU. The default stemmer for German is Bitext. The new configuration specifies a user-defined lexer named sample_lexer as the custom lexer and leaves the default stemmer unchanged. Assume the plugin configuration described above.
Note that this example doesn't actually change the language configuration because it does not call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript).
If you run the example in Query Console, you should see output similar to the following:
<lang:user-languages xml:lang="zxx" xmlns:lang="http://marklogic.com/xdmp/language"> <lang:user-language> <lang:name>de</lang:name> <lang:plugin> <lang:library/> <lang:lexer> <lang:library>native/sampleplugin</lang:library> <lang:variant>sample_lexer</lang:variant> </lang:lexer> </lang:plugin> </lang:user-language> </lang:user-languages>
Use the clang:user-language-plugin XQuery function or the clang.userLanguagePlugin Server-Side JavaScript function to define a binding between a language and custom tokenization and stemming plugins. For more details, see Customization Using a Built-In Lexer or Stemmer and Customization Using a User-Defined Lexer or Stemmer.
To put the configuration change into effect, use the following pattern. A complete example follow.
This operation is an overwrite: Any previous configuration for the language will be replaced. Thus, if you are going to configure both a lexer and a stemmer for a language, do it in a single call to clang:update-user-language or clang.updateUserLanguage.
Calling clang:language-config-write or clang.languageConfigWrite causes MarkLogic to restart.
The following example configures a custom stemmer and lexer for the Catalan language using a user-defined plugin. Assume the plugin registers a lexer named sample_lexer and a stemmer named sample_stemmer.
You can configure just a lexer or just a stemmer for a language by including just that reference when calling clang:user-language-plugin or clang.userLanguagePlugin. For example, the following code only configures a custom stemmer.
To configure a lexer and stemmer from different plugin libraries for the same language, specify the plugin path to the lexer and stemmer reference constructors. For example, the following code configures a lexer and a stemmer from two different plugins:
Use the clang:delete-user-language XQuery function or the clang.deleteUserLanguage
JavaScript function to remove the custom configuration for a specific language. You must call clang:language-config-write (XQuery) or clang.languageConfigWrite (JavaScript) for your change to take effect, and doing so will cause MarkLogic to restart.
The following example removes the configuration for the Catalan language (language code ca).
To remove custom stemmer and lexer bindings for all languages, use the clang:language-config-delete XQuery function or the clang.languageConfigDelete Server-Side JavaScript function.
Calling these functions restarts MarkLogic.
The following example code removes all language customizations and restarts the server.
You can use delegation to control whether stemming consults the default stemmer in addition to the custom plugin for a language. Delegation can be controlled at two levels:
delegation
option, and the StemmerUDF
interface for user-defined plugins has a delegate
method.delegate
parameter. When set to true (the default), the stemming process asks the plugin whether or not to delegate.If delegation is enabled at the language plugin configuration level, then the stemming process will consult the custom plugin about whether or not to delegate. For example, it will call the delegate
method on StemmerUDF
. If delegation is disabled at the language plugin configuration level, then the stemming process will not consult the custom plugin and will never delegate to the default stemmer.
Delegation has the following effect on the stemming results. The first column indicates whether or not the delegate
parameter of clang:user-language-plugin or clang.userLanguagePlugin is set to true. The second column indicates whether or not the custom plugin agrees to delegate; for example, whether StemmerUDF::delegate
returns true.
Lang Config Delegate | Plugin Says to Delegate | Result |
---|---|---|
true | true | |
true | false | |
false | N/A |
The following table contains examples of the stemming result with various delegation and stem count combinations. The with Delegation column signifies the plugin was consulted and agreed to delegation. The without Delegation column signifies either the plugin was not consulted or the plugin did not agree to delegation.
A custom plugin determines its own delegation policy. For example, a plugin might choose among policies such as always (delegate regardless of the number of stems found), never (never delegate), or on empty (delegate only if the plugin found no stems).
When you configure a language to use a custom user-defined stemmer or lexer, and also associate a custom dictionary with the language, then you can create special security privileges to enable finer control over who can administer the dictionary.
A custom dictionary is associated with both a language and a specific stemmer or lexer plugin. The lexer or stemmer is implicit in the configuration of the language. Usually, any user with the custom-dictionary-admin
role or equivalent privileges can add, update, or delete a custom dictionary for any language-stemmer or language-lexer configuration.
You can create a privileges of the following form to make it possible to control dictionary management on a per stemmer/lexer basis.
http://marklogic.com/xdmp/privileges/custom-dictionary-admin/library http://marklogic.com/xdmp/privileges/xdmp-write-cluster-config-file/library
Where library is of the form plugin_path/
plugin_id and identifies a user-defined lexer or stemmer plugin.
For example, if you install a user-defined lexer plugin with the plugin path native and the plugin id sampleplugin, then you would create a privileges of the following form:
http://marklogic.com/xdmp/privileges/custom-dictionary-admin/native/sampleplugin http://marklogic.com/xdmp/privileges/xdmp-write-cluster-config-file/native/sampleplugin
MarkLogic will not create these privilege for you, but it will check for and enforce them if the privileges exist.
The table below lists the built-in lexers (tokenizers), which languages each one is configured for by default, and what configuration options (if any) are available for customization. Use the configuration lexer name and configuration options when calling the clang:lexer XQuery function or the clang.lexer JavaScript function; for details, see Customization Using a Built-In Lexer or Stemmer.
Lexer Name | Description | |
---|---|---|
simple lexer | The default lexer for English, Norwegian, and languages without advanced support. You cannot specify this lexer as a custom plugin, and it has no configuration options. | |
icu | Default tokenizer for most licensed languages. Users might want to switch to this tokenizer for English to pick up better apostrophe and contraction handling, or for languages without advanced support. This lexer accepts no extra arguments. | |
kytea | Default tokenizer for Chinese. With the appropriate language model, this tokenizer could be used for other languages. You can customize the behavior of this lexer using the following arguments: | |
model_filename |
Required. The name of a model file in the MARKLOGIC_DIR KyTea offers Japanese models alternative models for Chinese at http://www.phontron.com/kytea/model.html. |
|
atilika | Default tokenizer for Japanese. You can customize the behavior of this lexer using the following arguments: | |
search-mode normal-mode |
Specify the handling of compound words: search-mode breaks up compound words, while normal-mode does not. Default: search-mode . |
The tables below list the built-in stemmers. Use the stemmer name and options when constructing a stemmer configuration item using the clang:stemmer XQuery function or the clang.stemmer JavaScript function.
MarkLogic uses the following built-in stemmers by default:
See the following topics for configuration options.
The Bitext stemmer supports the following options that you can specify in the args
parameter of clang:stemmer or clang.stemmer.
The Snowball stemmer supports the following options that you can specify in the args
parameter of clang:stemmer or clang.stemmer.
The Atilika stemmer supports the following options that you can specify in the arguments
parameter of clang:stemmer or clang.stemmer.