In addition to the language support described in Language Support in MarkLogic Server, MarkLogic Server also supports many character encodings and has the ability to sort the content in a variety of collations. This chapter describes the MarkLogic Server support of encodings and collations, and includes the following sections:
MarkLogic Server stores all content in the UTF-8 encoding. If you try to load non-UTF-8 content into MarkLogic Server without translating it to UTF-8, the server throws an exception. If you have non-UTF-8 content, then you can specify the encoding for the content during ingestion, and MarkLogic Server will translate it to UTF-8. To specify an encoding, use the
encoding option to xdmp:document-load, xdmp:document-get, and xdmp:zip-get. This option tells MarkLogic Server that your content is in that encoding, and MarkLogic Server will attempt to translate that encoding to UTF-8. If the content cannot be translated, MarkLogic Server throws an exception indicating that there is non-UTF-8 content.
encodingoption is available to xdmp:document-load, xdmp:document-get, and xdmp:zip-get.
For details on the syntax of the encoding option, see the MarkLogic XQuery and XSLT Function Reference.
A collation specifies the order for sorting strings. The collation settings determine the order for operations where the order is specified (either implicitly or explicitly) and for operations that use Range Indexes. Examples of operations that specify the order are XQuery statements with an
order by clause, XQuery standard functions that compare order (for example, fn:compare, fn:substring-after,
fn:substring-before, and so on), and lexicon functions (for example, cts:words, cts:element-word-match, cts:element-values, and so on). Additionally, collations determine uniqueness in string comparisons, so two strings that are equal according to one collation might be not be equal according to another.
The codepoint-order collation sorts according to the Unicode codepoint order, which does not take into account any language-specific information. There are other collations that are often used to specify language-specific sorting differences. For example, a code point sort puts all uppercase letters before lower-case letters, so the word
Zounds sorts before the word
abracadabra. If you use a collation that sorts upper and lower-case letters together (for example, the order
A a B b C c, and so on), then
abracadabra sorts before
Collations are specified with a URI (for example,
http://marklogic.com/collation/). The collation URIs are specific to MarkLogic Server, but they specify collations according to the Unicode collation standards. There are many variations to collations, and many sort orders that are based on preferences and traditions in various languages. The following section describes the syntax of collation URIs. Although there are a huge number of collation URIs possible, most applications will use only a small number of collations. For more information about collations, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.
The first one is the UCA Root Collation (see UCA Root Collation), and is the system default. The second is the codepoint order collation, and was the default in pre-3.2 releases of MarkLogic Server.
Some languages (for example, German and Chinese) have multiple collations you can specify in the locale. To specify one of these language-specific collation variants, use the
@collation=<value> portion of the syntax.
If you do not specify a locale in the collation URI, the UCA Root Collation is used by default (for details, see UCA Root Collation).
While you can specify many valid language, script, or region codes, MarkLogic Server only fully supports those that are relevant to and most commonly used with the supported languages. For a list of supported languages along with their common collations, see Collations and Character Sets By Language.
|English language with United States region|
|German language with the |
There can be zero or more
<attribute> portions of the collation URI. Attributes further specify characteristics such as which collation to use, whether to be case sensitive or case insensitive, and so on. You only need to specify attributes if they differ from the defaults for the specified locale. Attributes have the following syntax:
<attribute> ::= <strength> | <case-level> | <case-first> | <alternate> | <numeric-collation> | <variable-top> | <normalization-checking> | <french> | <hiragana>
The following table describes the various attributes. For simplicity, terms like case-sensitive, diacritic-sensitive, and others are used. In actuality, the definitions of these terms for use in collations are somewhat more complicated. For the exact technical meaning of each attribute, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.
|The level of comparison to use.||Specifies case and diacritic insensitive.|
|Specifies diacritic sensitive and case insensitive.|
|Specifies case and diacritic sensitive.|
|Specifies punctuation sensitive.|
|Specifies identity (codepoint differentiated).|
|Specifies enable case-level.|
|Specifies disable case-level.|
|Specifies that uppercase sorts first.|
|Specifies that lowercase sorts first.|
|Specifies that all characters are non-ignorable; that is, include all spaces and punctuation characters when sorting characters.|
|Specifies that variable characters are shifted (ignored) according to the |
|Specifies numeric ordering.|
|Specifies non-numeric ordering (order according to the collation).|
|Specifies that all variable characters (typically whitespace and punctuation) are ignored for sorting variable characters.|
|Specifies that whitespace is ignorable when sorting characters. For example, |
|Specifies that most punctuation and space characters are ignorable when sorting characters. Specifically, characters whose sort key is less than or equal to |
|Specifies normalize Unicode.|
|Specifies do not normalize Unicode.|
|Specifies French accent ordering.|
|Specifies normal ordering (according to the collation).|
|Hiragana mode on.|
|Hiragana mode off.|
Range Indexes and lexicons that were created in MarkLogic Server 3.1 use the Unicode codepoint collation order. If you want them to use a different collation in any of these indexes and/or lexicons, you must change the collation and re-create the index, and then reindex the database (if
reindex enable is set to true, it will automatically begin reindexing).
The Unicode collation algorithm (UCA) root collation in MarkLogic Server is used when no default exists. It uses the Unicode codepoint collation with S3 (case and diacritic sensitive) strength, and it has the following URI:
The UCA root collation adds more useful case and diacritic sensitivity to the Unicode codepoint order, so it will make more sensible sort orders when you take case sensitivity and diacritic sensitivity into consideration. For more details about the UCA, see http://www.unicode.org/unicode/reports/tr10/.
The collation used for requests in MarkLogic Server is based on the settings of various parameters in the Admin Interface and on what is specified in your XQuery code. Each App Server has a default collation specified, and that is used in the absence of anything else that overrides it. Note the following about collations and their defaults.
xdmp:collation-canonical-uribuilt-in XQuery function returns the canonical URI of any valid collation URI.
declare default collationexpression in the prolog), but it will default to the context from the calling module.
order byclause of a FLWOR expression.
fn:deep-equals, fn:distinct-values, fn:index-of, fn:max, fn:min).
|Language||Base Collations||Character Sets|
|alternate German collation|
|Treats ll and ch as distinct characters|
|Chinese (Simplified and Traditional)||case/diacritic sensitive|
|Persian (Farsi)||case/diacritic sensitive|
|Hiragana mode off|
|Norwegian (Nynorsk and Bokmål)||case/diacritic sensitive|
All of the languages except English require a license key to enable. If you do not have the license key for one of the supported languages, it is treated as a generic language, and each word is stemmed to itself and it is tokenized in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters). For more information, see Generic Language Support. The language-specific collations are available to all languages, regardless of what languages are enabled in the license key.