Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 30

Encodings and Collations

In addition to the language support described in Language Support in MarkLogic Server, MarkLogic Server also supports many character encodings and has the ability to sort the content in a variety of collations. This chapter describes the MarkLogic Server support of encodings and collations, and includes the following sections:

Character Encoding

MarkLogic Server stores all content in the UTF-8 encoding. If you try to load non-UTF-8 content into MarkLogic Server without translating it to UTF-8, the server throws an exception. If you have non-UTF-8 content, then you can specify the encoding for the content during ingestion, and MarkLogic Server will translate it to UTF-8. If the content cannot be translated, MarkLogic Server throws an exception indicating that there is non-UTF-8 content.

You can specify an explicit encoding in the following ways:

Encoding is determined using the following precedence, from highest to lowest:

  • The encoding option of the ingestion function, if set.
  • The encoding specified by the HTTP headers, if present.
  • Otherwise, assume UTF-8.

If you set the encoding option to 'auto', then MarkLogic tries to determine the encoding from the document content.

If the encoding is UTF-8 and any non-UTF-8 characters are found, an exception is thrown indicating the content contains non-UTF-8 characters.

MarkLogic Server assumes the character set you specify is actually the character set of the content. If you specify an encoding that is different from the actual content encoding, the result can be unpredictable: You might get an exception in some situations, but you might end up with the wrong characters in other situations.

For details on the syntax of the encoding option, see the MarkLogic XQuery and XSLT Function Reference.

Collations

This section describes collations in MarkLogic Server. Collations specify the order in which strings are sorted and how they are compared. The section includes the following parts:

Overview of Collations

A collation specifies the order for sorting strings. The collation settings determine the order for operations where the order is specified (either implicitly or explicitly) and for operations that use Range Indexes. Examples of operations that specify the order are XQuery statements with an order by clause, XQuery standard functions that compare order (for example, fn:compare, fn:substring-after, fn:substring-before, and so on), and lexicon functions (for example, cts:words, cts:element-word-match, cts:element-values, and so on). Additionally, collations determine uniqueness in string comparisons, so two strings that are equal according to one collation might be not be equal according to another.

The codepoint-order collation sorts according to the Unicode codepoint order, which does not take into account any language-specific information. There are other collations that are often used to specify language-specific sorting differences. For example, a code point sort puts all uppercase letters before lower-case letters, so the word Zounds sorts before the word abracadabra. If you use a collation that sorts upper and lower-case letters together (for example, the order A a B b C c, and so on), then abracadabra sorts before Zounds.

Collations are specified with a URI (for example, http://marklogic.com/collation/). The collation URIs are specific to MarkLogic Server, but they specify collations according to the Unicode collation standards. There are many variations to collations, and many sort orders that are based on preferences and traditions in various languages. The following section describes the syntax of collation URIs. Although there are a huge number of collation URIs possible, most applications will use only a small number of collations. For more information about collations, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.

Two Common Collation URIs

The following are two very common collation URIs used in MarkLogic Server:

  • http://marklogic.com/collation/
  • http://marklogic.com/collation/codepoint

The first one is the UCA Root Collation (see UCA Root Collation), and is the system default. The second is the codepoint order collation, and was the default in pre-3.2 releases of MarkLogic Server.

Collation URI Syntax

Collations in MarkLogic Server are specified by a URI. All collations begin with the string http://marklogic.com/collation/. The syntax for collations is as follows:

http://marklogic.com/collation/<locale>[/<attribute>]*

This section describes the following parts of the syntax:

Locale Portion of the Collation URI

The <locale> portion of the collation URI must be a valid locale, and is defined as follows:

<locale> ::= <language>[-<script>][_<region>][@(collation=<value>;)+]

For a list of valid language codes, see the following:

http://www.loc.gov/standards/iso639-2/php/code_list.php

For a list of valid script codes, see the following:

http://www.unicode.org/iso15924/iso15924-codes.html

For a list of valid region codes, see the following:

http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html

Some languages (for example, German and Chinese) have multiple collations you can specify in the locale. To specify one of these language-specific collation variants, use the @collation=<value> portion of the syntax.

If you do not specify a locale in the collation URI, the UCA Root Collation is used by default (for details, see UCA Root Collation).

While you can specify many valid language, script, or region codes, MarkLogic Server only fully supports those that are relevant to and most commonly used with the supported languages. For a list of supported languages along with their common collations, see Collations and Character Sets By Language.

The following table lists some typical locales, along with a brief description:

Locale Description Collation URI
en
English language
http://marklogic.com/collation/en
en_US
English language with United States region
http://marklogic.com/collation/en_US
zh
Chinese language
http://marklogic.com/collation/zh
de@collation=phonebook
German language with the phonebook collation
http://marklogic.com/collation/de@collation=phonebook
Attribute Portion of the Collation URI

There can be zero or more <attribute> portions of the collation URI. Attributes further specify characteristics such as which collation to use, whether to be case sensitive or case insensitive, and so on. You only need to specify attributes if they differ from the defaults for the specified locale. Attributes have the following syntax:

<attribute> ::= <strength> | <case-level> | <case-first> | 
                <alternate> | <numeric-collation> | 
                <variable-top> | <normalization-checking> | 
                <french> | <hiragana>

The following table describes the various attributes. For simplicity, terms like case-sensitive, diacritic-sensitive, and others are used. In actuality, the definitions of these terms for use in collations are somewhat more complicated. For the exact technical meaning of each attribute, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.

Attribute Legal Values Descriptions
<strength>
The level of comparison to use.
S1
Specifies case and diacritic insensitive.
S2
Specifies diacritic sensitive and case insensitive.
S3
Specifies case and diacritic sensitive.
S4
Specifies punctuation sensitive.
SI
Specifies identity (codepoint differentiated).
<case-level>

Enable or disable the case sensitive level, skipping the diacritic sensitive level. So diacritic insensitive, case sensitive is /S1/EO

Default: EX

EO
Specifies enable case-level.
EX
Specifies disable case-level.
<case-first>

Specifies whether uppercase sorts before or after lowercase.

Default: CX

CU
Specifies that uppercase sorts first.
CL
Specifies that lowercase sorts first.
CX
Off.
<alternate>

Specifies how to handle variable characters. (As completely ignorable or as normal characters.)

Default: AN

AN
Specifies that all characters are non-ignorable; that is, include all spaces and punctuation characters when sorting characters.
AS
Specifies that variable characters are shifted (ignored) according to the variable-top setting.
<numeric-collation>

Order numbers as numbers rather than collation order (for example, 20 < 100).

Default: MX

MO
Specifies numeric ordering.
MX
Specifies non-numeric ordering (order according to the collation).
<variable-top>

Used with alternate to specify which variable characters are ignorable. Any character that is primary-less-than (for details on this concept, see the Unicode link in UCA Root Collation) the cutoff character will be treated as ignorable. Only meaningful in combination with AS.

Default: T0000

T0000
Specifies that all variable characters (typically whitespace and punctuation) are ignored for sorting variable characters.
T0020
Specifies that whitespace is ignorable when sorting characters. For example, /T0020/AS means that period (a variable character) would be treated as a regular character but space would be ignorable. Therefore:
A B = AB and AB < A.B.
T00BB
Specifies that most punctuation and space characters are ignorable when sorting characters. Specifically, characters whose sort key is less than or equal to 00BB are ignorable.
<normalization-checking>

Specifies whether to perform Unicode normalization on the input string.

Default: NX

NO
Specifies normalize Unicode.
NX
Specifies do not normalize Unicode.
<french>

Specifies whether to apply the French accent ordering rule (that is, to reverse the ordering at the S3 level).

Default: FX

FO
Specifies French accent ordering.
FX
Specifies normal ordering (according to the collation).
<hiragana>

Specifies whether to add an additional level to distinguish Hiragana from Katakana.

Default: HX

HO
Hiragana mode on.
HX
Hiragana mode off.

Backward Compatibility with 3.1 Range Indexes and Lexicons

Range Indexes and lexicons that were created in MarkLogic Server 3.1 use the Unicode codepoint collation order. If you want them to use a different collation in any of these indexes and/or lexicons, you must change the collation and re-create the index, and then reindex the database (if reindex enable is set to true, it will automatically begin reindexing).

UCA Root Collation

The Unicode collation algorithm (UCA) root collation in MarkLogic Server is used when no default exists. It uses the Unicode codepoint collation with S3 (case and diacritic sensitive) strength, and it has the following URI:

http://marklogic.com/collation/

The UCA root collation adds more useful case and diacritic sensitivity to the Unicode codepoint order, so it will make more sensible sort orders when you take case sensitivity and diacritic sensitivity into consideration. For more details about the UCA, see http://www.unicode.org/unicode/reports/tr10/.

How Collation Defaults are Determined

The collation used for requests in MarkLogic Server is based on the settings of various parameters in the Admin Interface and on what is specified in your XQuery code. Each App Server has a default collation specified, and that is used in the absence of anything else that overrides it. Note the following about collations and their defaults.

  • Collations are specified at the App Server level, on Range Indexes, and on lexicons.
  • App Servers, Range Indexes, and lexicons upgraded from 3.1 remain in codepoint order (http://marklogic.com/collation/codepoint).
  • New App Servers default to the UCA Root Collation (http://marklogic.com/collation/).
  • New Range Indexes and lexicons default to UCA Root Collation (http://marklogic.com/collation/).
  • You can specify a default collation in an XQuery prolog, which overrides the App Server default. For example, the following query will use the French collation:
    xquery version "1.0-ml";
    declare default collation "http://marklogic.com/collation/fr";
    
    for $x in ("côte", "cote", "coté", "côté", "cpte" )
    order by $x 
    return $x
  • The codepoint collation URI is as follows:
    http://marklogic.com/collation/codepoint

    The following is an alias to the codepoint collation URI (used with the 1.0 strict XQuery dialect):

    http://www.w3.org/2005/xpath-functions/collation/codepoint
  • Collation URIs displayed in the Admin Interface are stored and displayed as the canonical representation of the URI entered. The canonical representation is equivalent to the URI entered, but changes the order and simplifies portions of the collation URI string to a predetermined order. The xdmp:collation-canonical-uri built-in XQuery function returns the canonical URI of any valid collation URI.
  • The empty string URI becomes codepoint collation. Therefore, the following returns as shown:
    xdmp:collation-canonical-uri("")
    => http://marklogic.com/collation/codepoint
  • The collation used in an XQuery module is determined on a per-module basis. Therefore, a module might call another module that uses a different collation, as each module determines its collation independent of the module that called it (based on the App Server defaults, collation prolog declaration, and so on).
  • When a module is invoked or spawned from another module, or when a request is submitted via an xdmp:eval call from another module, the new request inherits the collation context of the calling module. That context can be overridden in the query (for example, with a declare default collation expression in the prolog), but it will default to the context from the calling module.
  • If no other collations are in effect (for example, for scheduled tasks), the codepoint collation is used.

Specifying Collations

You can specify collations in many places. Some common places to specify collations are:

Collations and Character Sets By Language

The following table lists the languages in which MarkLogic Server supports language-specific tokenization and stemming. It also lists some common collations and character sets for each language.

Language Base Collations Character Sets
English
http://marklogic.com/collation/en
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/en/S1
case/diacritic insensitive
http://marklogic.com/collation/en/S2
diacritic sensitive
http://marklogic.com/collation/en/S1/EO
case sensitive
French
http://marklogic.com/collation/fr
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/fr/S1
case/diacritic insensitive
http://marklogic.com/collation/fr/S2
diacritic sensitive
http://marklogic.com/collation/fr/S1/EO
case sensitive
Italian
http://marklogic.com/collation/it
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/it/S1
case/diacritic insensitive
http://marklogic.com/collation/it/S2
diacritic sensitive
http://marklogic.com/collation/it/S1/EO
case sensitive
German
http://marklogic.com/collation/de
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/de/S1
case/diacritic insensitive
http://marklogic.com/collation/de/S2
diacritic sensitive
http://marklogic.com/collation/de/S1/EO
case sensitive
http://marklogic.com/collation/de@collation=phonebook
alternate German collation
Spanish
http://marklogic.com/collation/es
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/es/S1
case/diacritic insensitive
http://marklogic.com/collation/es/S2
diacritic sensitive
http://marklogic.com/collation/es/S1/EO
case sensitive
http://marklogic.com/collation/es@collation=traditional
Treats ll and ch as distinct characters
Russian
http://marklogic.com/collation/ru
case/diacritic sensitive
cp1251
KOI8-R
ISO-8859-5
http://marklogic.com/collation/ru/S1
case/diacritic insensitive
http://marklogic.com/collation/ru/S2
diacritic sensitive
http://marklogic.com/collation/ru/S1/EO
case sensitive
Arabic
http://marklogic.com/collation/ar
form-variant sensitive
cp1256
ISO-8859-6
http://marklogic.com/collation/ar/S1
form-variant insensitive
Chinese (Simplified and Traditional)
http://marklogic.com/collation/zh
(simplified)
case/diacritic sensitive
Simplified:

GB18030
GB2312
EUC-CN
hz-gb-2312
cp936

Traditional:

Big5
Big5-HKSCS
cp950
GB18030
http://marklogic.com/collation/zh-Hant
(traditional)
case/diacritic sensitive
http://marklogic.com/collation/zh-Hant@collation=stroke
(traditional with simplified order)
locale-specific variant
http://marklogic.com/collation/zh@collation=pinyan
(simplified with traditional order)
locale-specific variant
Korean
http://marklogic.com/collation/ko
case/diacritic sensitive
ISO 2022-KR
EUC-KR
KS X 1001
cp949
GB12052
KSC 5636
http://marklogic.com/collation/ko/S1
case/diacritic insensitive
Persian (Farsi)
http://marklogic.com/collation/fa
case/diacritic sensitive
cp1256
ISO-8859-6
http://marklogic.com/collation/fa/S1
case/diacritic insensitive
http://marklogic.com/collation/fa/S2
diacritic sensitive
http://marklogic.com/collation/fa/NX
disable normalization
Dutch
http://marklogic.com/collation/nl
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/nl/S1
case/diacritic insensitive
http://marklogic.com/collation/nl/S2
diacritic sensitive
http://marklogic.com/collation/nl/S1/EO
case sensitive
Japanese
http://marklogic.com/collation/ja
http://marklogic.com/collation/ja/S1
case/diacritic insensitive
Shift JIS:
cp932
ibm-942
ibm-943

EUC-JP:
EUC-JISX0213
ibm-954

ISO-2022-JP:
ISO-2022-JP-1
ISO-2022-JP-2
ISO-2022-JP-3
ISO-2022-JP-2004
http://marklogic.com/collation/ja/S2
diacritic sensitive
http://marklogic.com/collation/ja/S1/EO
case sensitive
http://marklogic.com/collation/ja/S4/HX
Hiragana mode off
Portuguese
http://marklogic.com/collation/pt
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/pt/S1
case/diacritic insensitive
http://marklogic.com/collation/pt/S2
diacritic sensitive
http://marklogic.com/collation/pt/S1/EO
case sensitive
Norwegian (Nynorsk and Bokmål)
http://marklogic.com/collation/nn
(Nynorsk)
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/nn/S1
case/diacritic insensitive
http://marklogic.com/collation/nn/S2
diacritic sensitive
http://marklogic.com/collation/nn/S1/EO
case sensitive
http://marklogic.com/collation/nb
(Bokmål)
case/diacritic sensitive
http://marklogic.com/collation/nb/S1
case/diacritic insensitive
http://marklogic.com/collation/nb/S2
diacritic sensitive
http://marklogic.com/collation/nb/S1/EO
case sensitive
Swedish
http://marklogic.com/collation/sv
case/diacritic sensitive
ISO-8859-1
cp1252
http://marklogic.com/collation/sv/S1
case/diacritic insensitive
http://marklogic.com/collation/sv/S2
diacritic sensitive
http://marklogic.com/collation/sv/S1/EO
case sensitive

All of the languages except English require a license key to enable. If you do not have the license key for one of the supported languages, it is treated as a generic language, and each word is stemmed to itself and it is tokenized in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters). For more information, see Generic Language Support. The language-specific collations are available to all languages, regardless of what languages are enabled in the license key.

« Previous chapter
Next chapter »