In addition to the language support described in Language Support in MarkLogic Server, MarkLogic Server also supports many character encodings and has the ability to sort the content in a variety of collations. This chapter describes the MarkLogic Server support of encodings and collations, and includes the following sections:
MarkLogic Server stores all content in the UTF-8 encoding. If you try to load non-UTF-8 content into MarkLogic Server without translating it to UTF-8, the server throws an exception. If you have non-UTF-8 content, then you can specify the encoding for the content during ingestion, and MarkLogic Server will translate it to UTF-8. If the content cannot be translated, MarkLogic Server throws an exception indicating that there is non-UTF-8 content.
You can specify an explicit encoding in the following ways:
charset
parameter of the Content-type header.encoding
option of the functions listed in the following table.XQuery | JavaScript |
---|---|
xdmp:document-load | xdmp.documentLoad |
xdmp:document-get | xdmp.documentGet |
xdmp:zip-get | xdmp.zipGet |
xdmp:gunzip | xdmp.gunzip |
xdmp:xslt-invoke | xdmp.xsltInvoke |
Encoding is determined using the following precedence, from highest to lowest:
If you set the encoding
option to auto, then MarkLogic tries to determine the encoding from the document content.
If the encoding is UTF-8 and any non-UTF-8 characters are found, an exception is thrown indicating the content contains non-UTF-8 characters.
MarkLogic Server assumes the character set you specify is actually the character set of the content. If you specify an encoding that is different from the actual content encoding, the result can be unpredictable: You might get an exception in some situations, but you might end up with the wrong characters in other situations.
For details on the syntax of the encoding
option, see the MarkLogic XQuery and XSLT Function Reference.
This section describes collations in MarkLogic Server. Collations specify the order in which strings are sorted and how they are compared. The section includes the following parts:
A collation specifies the order for sorting strings. The collation settings determine the order for operations where the order is specified (either implicitly or explicitly) and for operations that use Range Indexes. Examples of operations that specify the order are XQuery statements with an order by
clause, XQuery standard functions that compare order (for example, fn:compare, fn:substring-after, fn:substring-before
, and so on), and lexicon functions (for example, cts:words, cts:element-word-match, cts:element-values, and so on). Additionally, collations determine uniqueness in string comparisons, so two strings that are equal according to one collation might be not be equal according to another.
The codepoint-order collation sorts according to the Unicode codepoint order, which does not take into account any language-specific information. There are other collations that are often used to specify language-specific sorting differences. For example, a code point sort puts all uppercase letters before lower-case letters, so the word Zounds
sorts before the word abracadabra
. If you use a collation that sorts upper and lower-case letters together (for example, the order A a B b C c
, and so on), then abracadabra
sorts before Zounds
.
Collations are specified with a URI (for example, http://marklogic.com/collation/
). The collation URIs are specific to MarkLogic Server, but they specify collations according to the Unicode collation standards. There are many variations to collations, and many sort orders that are based on preferences and traditions in various languages. The following section describes the syntax of collation URIs. Although there are a huge number of collation URIs possible, most applications will use only a small number of collations. For more information about collations, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.
The following are two very common collation URIs used in MarkLogic Server:
The first one is the UCA Root Collation (see UCA Root Collation), and is the system default. The second is the codepoint order collation, and was the default in pre-3.2 releases of MarkLogic Server.
Collations in MarkLogic Server are specified by a URI. All collations begin with the string http://marklogic.com/collation/
. The syntax for collations is as follows:
http://marklogic.com/collation/<locale>[/<attribute>]*
This section describes the following parts of the syntax:
The <locale>
portion of the collation URI must be a valid locale, and is defined as follows:
<locale> ::= <language>[-<script>][_<region>][@(collation=<value>;)+]
For a list of valid language codes, see the following:
http://www.loc.gov/standards/iso639-2/php/code_list.php
For a list of valid script codes, see the following:
http://www.unicode.org/iso15924/iso15924-codes.html
For a list of valid region codes, see the following:
http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html
Some languages (for example, German and Chinese) have multiple collations you can specify in the locale. To specify one of these language-specific collation variants, use the @collation=<value>
portion of the syntax.
If you do not specify a locale in the collation URI, the UCA Root Collation is used by default (for details, see UCA Root Collation).
While you can specify many valid language, script, or region codes, MarkLogic Server only fully supports those that are relevant to and most commonly used with the supported languages. For a list of supported languages along with their common collations, see Collations and Character Sets By Language.
The following table lists some typical locales, along with a brief description:
There can be zero or more <attribute>
portions of the collation URI. Attributes further specify characteristics such as which collation to use, whether to be case sensitive or case insensitive, and so on. You only need to specify attributes if they differ from the defaults for the specified locale. Attributes have the following syntax:
<attribute> ::= <strength> | <case-level> | <case-first> | <alternate> | <numeric-collation> | <variable-top> | <normalization-checking> | <french> | <hiragana>
The following table describes the various attributes. For simplicity, terms like case-sensitive, diacritic-sensitive, and others are used. In actuality, the definitions of these terms for use in collations are somewhat more complicated. For the exact technical meaning of each attribute, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.
Attribute | Legal Values | Descriptions |
---|---|---|
<strength>The level of comparison to use. |
S1 |
Specifies case and diacritic insensitive. |
S2 |
Specifies diacritic sensitive and case insensitive. | |
S3 |
Specifies case and diacritic sensitive. | |
S4 |
Specifies punctuation sensitive. | |
SI |
Specifies identity (codepoint differentiated). | |
<case-level> Enable or disable the case sensitive level, skipping the diacritic sensitive level. So diacritic insensitive, case sensitive is |
EO |
Specifies enable case-level. |
EX |
Specifies disable case-level. | |
<case-first> Specifies whether uppercase sorts before or after lowercase. |
CU |
Specifies that uppercase sorts first. |
CL |
Specifies that lowercase sorts first. | |
CX |
Off. | |
<alternate> Specifies how to handle variable characters. (As completely ignorable or as normal characters.) |
AN |
Specifies that all characters are non-ignorable; that is, include all spaces and punctuation characters when sorting characters. |
AS |
Specifies that variable characters are shifted (ignored) according to the variable-top setting. |
|
<numeric-collation> Order numbers as numbers rather than collation order (for example, 20 < 100). |
MO |
Specifies numeric ordering. |
MX |
Specifies non-numeric ordering (order according to the collation). | |
<variable-top> Used with |
T0000 |
Specifies that all variable characters (typically whitespace and punctuation) are ignored for sorting variable characters. |
T0020 |
Specifies that whitespace is ignorable when sorting characters. For example, /T0020/AS means that period (a variable character) would be treated as a regular character but space would be ignorable. Therefore:A B = AB and AB < A.B. |
|
T00BB |
Specifies that most punctuation and space characters are ignorable when sorting characters. Specifically, characters whose sort key is less than or equal to 00BB are ignorable. |
|
<normalization-checking> Specifies whether to perform Unicode normalization on the input string. |
NO |
Specifies normalize Unicode. |
NX |
Specifies do not normalize Unicode. | |
<french> Specifies whether to apply the French accent ordering rule (that is, to reverse the ordering at the |
FO |
Specifies French accent ordering. |
FX |
Specifies normal ordering (according to the collation). | |
<hiragana> Specifies whether to add an additional level to distinguish Hiragana from Katakana. |
HO |
Hiragana mode on. |
HX |
Hiragana mode off. |
Range Indexes and lexicons that were created in MarkLogic Server 3.1 use the Unicode codepoint collation order. If you want them to use a different collation in any of these indexes and/or lexicons, you must change the collation and re-create the index, and then reindex the database (if reindex enable
is set to true, it will automatically begin reindexing).
The Unicode collation algorithm (UCA) root collation in MarkLogic Server is used when no default exists. It uses the Unicode codepoint collation with S3 (case and diacritic sensitive) strength, and it has the following URI:
http://marklogic.com/collation/
The UCA root collation adds more useful case and diacritic sensitivity to the Unicode codepoint order, so it will make more sensible sort orders when you take case sensitivity and diacritic sensitivity into consideration. For more details about the UCA, see http://www.unicode.org/unicode/reports/tr10/.
The collation used for requests in MarkLogic Server is based on the settings of various parameters in the Admin Interface and on what is specified in your XQuery code. Each App Server has a default collation specified, and that is used in the absence of anything else that overrides it. Note the following about collations and their defaults.
http://marklogic.com/collation/codepoint
).http://marklogic.com/collation/
).http://marklogic.com/collation/
).xquery version "1.0-ml"; declare default collation "http://marklogic.com/collation/fr"; for $x in ("c¥te", "cote", "cot©", "c¥t©", "cpte" ) order by $x return $x
http://marklogic.com/collation/codepoint
The following is an alias to the codepoint collation URI (used with the 1.0
strict XQuery dialect):
http://www.w3.org/2005/xpath-functions/collation/codepoint
xdmp:collation-canonical-uri
built-in XQuery function returns the canonical URI of any valid collation URI.xdmp:collation-canonical-uri("") => http://marklogic.com/collation/codepoint
declare default collation
expression in the prolog), but it will default to the context from the calling module.You can specify collations in many places. Some common places to specify collations are:
order by
clause of a FLWOR expression.fn:deep-equals
, fn:distinct-values, fn:index-of, fn:max, fn:min).The following table lists the languages for which MarkLogic Server supports language-specific tokenization and stemming. It also lists some common collations and character sets for each language.
Note that some of the listed character set names can be ambiguous. MarkLogic uses the International Components for Unicode (ICU) library for character encoding and conversion. For best accuracy, refer to the ICU converter alias mapping at http://demo.icu-project.org/icu-bin/convexp.
All of the languages except English require a license key to enable. If you do not have the license key for one of the supported languages, it is treated as a generic language, and each word is stemmed to itself and it is tokenized in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters). For more information, see Generic Language Support. The language-specific collations are available to all languages, regardless of what languages are enabled in the license key.