Search Developer's Guide — Chapter 30

Encodings and Collations

In addition to the language support described in Language Support in MarkLogic Server, MarkLogic Server also supports many character encodings and has the ability to sort the content in a variety of collations. This chapter describes the MarkLogic Server support of encodings and collations, and includes the following sections:

Character Encoding
Collations
Collations and Character Sets By Language

Character Encoding

MarkLogic Server stores all content in the UTF-8 encoding. If you try to load non-UTF-8 content into MarkLogic Server without translating it to UTF-8, the server throws an exception. If you have non-UTF-8 content, then you can specify the encoding for the content during ingestion, and MarkLogic Server will translate it to UTF-8. If the content cannot be translated, MarkLogic Server throws an exception indicating that there is non-UTF-8 content.

You can specify an explicit encoding in the following ways:

If your content is ingested on behalf of an HTTP request, you can specify an encoding in the HTTP headers, such as setting the charset parameter of the Content-type header.

Set the encoding option of the functions listed in the following table.

XQuery	JavaScript
xdmp:document-load	xdmp.documentLoad
xdmp:document-get	xdmp.documentGet
xdmp:zip-get	xdmp.zipGet
xdmp:gunzip	xdmp.gunzip
xdmp:xslt-invoke	xdmp.xsltInvoke

Encoding is determined using the following precedence, from highest to lowest:

The encoding option of the ingestion function, if set.
The encoding specified by the HTTP headers, if present.
Otherwise, assume UTF-8.

If you set the encoding option to 'auto', then MarkLogic tries to determine the encoding from the document content.

If the encoding is UTF-8 and any non-UTF-8 characters are found, an exception is thrown indicating the content contains non-UTF-8 characters.

MarkLogic Server assumes the character set you specify is actually the character set of the content. If you specify an encoding that is different from the actual content encoding, the result can be unpredictable: You might get an exception in some situations, but you might end up with the wrong characters in other situations.

For details on the syntax of the encoding option, see the MarkLogic XQuery and XSLT Function Reference.

Collations

This section describes collations in MarkLogic Server. Collations specify the order in which strings are sorted and how they are compared. The section includes the following parts:

Overview of Collations
Two Common Collation URIs
Collation URI Syntax
Backward Compatibility with 3.1 Range Indexes and Lexicons
UCA Root Collation
How Collation Defaults are Determined
Specifying Collations

Overview of Collations

A collation specifies the order for sorting strings. The collation settings determine the order for operations where the order is specified (either implicitly or explicitly) and for operations that use Range Indexes. Examples of operations that specify the order are XQuery statements with an order by clause, XQuery standard functions that compare order (for example, fn:compare, fn:substring-after, fn:substring-before, and so on), and lexicon functions (for example, cts:words, cts:element-word-match, cts:element-values, and so on). Additionally, collations determine uniqueness in string comparisons, so two strings that are equal according to one collation might be not be equal according to another.

The codepoint-order collation sorts according to the Unicode codepoint order, which does not take into account any language-specific information. There are other collations that are often used to specify language-specific sorting differences. For example, a code point sort puts all uppercase letters before lower-case letters, so the word Zounds sorts before the word abracadabra. If you use a collation that sorts upper and lower-case letters together (for example, the order A a B b C c, and so on), then abracadabra sorts before Zounds.

Collations are specified with a URI (for example, http://marklogic.com/collation/). The collation URIs are specific to MarkLogic Server, but they specify collations according to the Unicode collation standards. There are many variations to collations, and many sort orders that are based on preferences and traditions in various languages. The following section describes the syntax of collation URIs. Although there are a huge number of collation URIs possible, most applications will use only a small number of collations. For more information about collations, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.

Two Common Collation URIs

The following are two very common collation URIs used in MarkLogic Server:

http://marklogic.com/collation/
http://marklogic.com/collation/codepoint

The first one is the UCA Root Collation (see UCA Root Collation), and is the system default. The second is the codepoint order collation, and was the default in pre-3.2 releases of MarkLogic Server.

Collation URI Syntax

Collations in MarkLogic Server are specified by a URI. All collations begin with the string http://marklogic.com/collation/. The syntax for collations is as follows:

http://marklogic.com/collation/<locale>[/<attribute>]*

This section describes the following parts of the syntax:

Locale Portion of the Collation URI
Attribute Portion of the Collation URI

Locale Portion of the Collation URI

The <locale> portion of the collation URI must be a valid locale, and is defined as follows:

<locale> ::= <language>[-<script>][_<region>][@(collation=<value>;)+]

For a list of valid language codes, see the following:

http://www.loc.gov/standards/iso639-2/php/code_list.php

For a list of valid script codes, see the following:

http://www.unicode.org/iso15924/iso15924-codes.html

For a list of valid region codes, see the following:

http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html

Some languages (for example, German and Chinese) have multiple collations you can specify in the locale. To specify one of these language-specific collation variants, use the @collation=<value> portion of the syntax.

If you do not specify a locale in the collation URI, the UCA Root Collation is used by default (for details, see UCA Root Collation).

While you can specify many valid language, script, or region codes, MarkLogic Server only fully supports those that are relevant to and most commonly used with the supported languages. For a list of supported languages along with their common collations, see Collations and Character Sets By Language.

The following table lists some typical locales, along with a brief description:

Locale	Description	Collation URI
en	English language	http://marklogic.com/collation/en
en_US	English language with United States region	http://marklogic.com/collation/en_US
zh	Chinese language	http://marklogic.com/collation/zh
de@collation=phonebook	German language with the `phonebook` collation	http://marklogic.com/collation/de@collation=phonebook

Attribute Portion of the Collation URI

There can be zero or more <attribute> portions of the collation URI. Attributes further specify characteristics such as which collation to use, whether to be case sensitive or case insensitive, and so on. You only need to specify attributes if they differ from the defaults for the specified locale. Attributes have the following syntax:

<attribute> ::= <strength> | <case-level> | <case-first> | 
                <alternate> | <numeric-collation> | 
                <variable-top> | <normalization-checking> | 
                <french> | <hiragana>

The following table describes the various attributes. For simplicity, terms like case-sensitive, diacritic-sensitive, and others are used. In actuality, the definitions of these terms for use in collations are somewhat more complicated. For the exact technical meaning of each attribute, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.

Attribute	Legal Values	Descriptions
<strength> The level of comparison to use.	S1	Specifies case and diacritic insensitive.
	S2	Specifies diacritic sensitive and case insensitive.
	S3	Specifies case and diacritic sensitive.
	S4	Specifies punctuation sensitive.
	SI	Specifies identity (codepoint differentiated).
<case-level> Enable or disable the case sensitive level, skipping the diacritic sensitive level. So diacritic insensitive, case sensitive is `/S1/EO` Default: `EX`	EO	Specifies enable case-level.
	EX	Specifies disable case-level.
<case-first> Specifies whether uppercase sorts before or after lowercase. Default: `CX`	CU	Specifies that uppercase sorts first.
	CL	Specifies that lowercase sorts first.
	CX	Off.
<alternate> Specifies how to handle variable characters. (As completely ignorable or as normal characters.) Default: `AN`	AN	Specifies that all characters are non-ignorable; that is, include all spaces and punctuation characters when sorting characters.
	AS	Specifies that variable characters are shifted (ignored) according to the `variable-top` setting.
<numeric-collation> Order numbers as numbers rather than collation order (for example, 20 < 100). Default: `MX`	MO	Specifies numeric ordering.
	MX	Specifies non-numeric ordering (order according to the collation).
<variable-top> Used with `alternate` to specify which variable characters are ignorable. Any character that is primary-less-than (for details on this concept, see the Unicode link in UCA Root Collation) the cutoff character will be treated as ignorable. Only meaningful in combination with `AS`. Default: `T0000`	T0000	Specifies that all variable characters (typically whitespace and punctuation) are ignored for sorting variable characters.
	T0020	Specifies that whitespace is ignorable when sorting characters. For example, `/T0020/AS` means that period (a variable character) would be treated as a regular character but space would be ignorable. Therefore: A B = AB and AB < A.B.
	T00BB	Specifies that most punctuation and space characters are ignorable when sorting characters. Specifically, characters whose sort key is less than or equal to `00BB` are ignorable.
<normalization-checking> Specifies whether to perform Unicode normalization on the input string. Default: `NX`	NO	Specifies normalize Unicode.
	NX	Specifies do not normalize Unicode.
<french> Specifies whether to apply the French accent ordering rule (that is, to reverse the ordering at the `S3` level). Default: `FX`	FO	Specifies French accent ordering.
	FX	Specifies normal ordering (according to the collation).
<hiragana> Specifies whether to add an additional level to distinguish Hiragana from Katakana. Default: `HX`	HO	Hiragana mode on.
	HX	Hiragana mode off.

Backward Compatibility with 3.1 Range Indexes and Lexicons

Range Indexes and lexicons that were created in MarkLogic Server 3.1 use the Unicode codepoint collation order. If you want them to use a different collation in any of these indexes and/or lexicons, you must change the collation and re-create the index, and then reindex the database (if reindex enable is set to true, it will automatically begin reindexing).

UCA Root Collation

The Unicode collation algorithm (UCA) root collation in MarkLogic Server is used when no default exists. It uses the Unicode codepoint collation with S3 (case and diacritic sensitive) strength, and it has the following URI:

http://marklogic.com/collation/

The UCA root collation adds more useful case and diacritic sensitivity to the Unicode codepoint order, so it will make more sensible sort orders when you take case sensitivity and diacritic sensitivity into consideration. For more details about the UCA, see http://www.unicode.org/unicode/reports/tr10/.

How Collation Defaults are Determined

The collation used for requests in MarkLogic Server is based on the settings of various parameters in the Admin Interface and on what is specified in your XQuery code. Each App Server has a default collation specified, and that is used in the absence of anything else that overrides it. Note the following about collations and their defaults.

Collations are specified at the App Server level, on Range Indexes, and on lexicons.
App Servers, Range Indexes, and lexicons upgraded from 3.1 remain in codepoint order (http://marklogic.com/collation/codepoint).
New App Servers default to the UCA Root Collation (http://marklogic.com/collation/).
New Range Indexes and lexicons default to UCA Root Collation (http://marklogic.com/collation/).

You can specify a default collation in an XQuery prolog, which overrides the App Server default. For example, the following query will use the French collation:

xquery version "1.0-ml";
declare default collation "http://marklogic.com/collation/fr";

for $x in ("côte", "cote", "coté", "côté", "cpte" )
order by $x 
return $x

The codepoint collation URI is as follows:
```
http://marklogic.com/collation/codepoint
```
The following is an alias to the codepoint collation URI (used with the 1.0 strict XQuery dialect):
```
http://www.w3.org/2005/xpath-functions/collation/codepoint
```
Collation URIs displayed in the Admin Interface are stored and displayed as the canonical representation of the URI entered. The canonical representation is equivalent to the URI entered, but changes the order and simplifies portions of the collation URI string to a predetermined order. The xdmp:collation-canonical-uri built-in XQuery function returns the canonical URI of any valid collation URI.
The empty string URI becomes codepoint collation. Therefore, the following returns as shown:
```
xdmp:collation-canonical-uri("")
=> http://marklogic.com/collation/codepoint
```
The collation used in an XQuery module is determined on a per-module basis. Therefore, a module might call another module that uses a different collation, as each module determines its collation independent of the module that called it (based on the App Server defaults, collation prolog declaration, and so on).
When a module is invoked or spawned from another module, or when a request is submitted via an xdmp:eval call from another module, the new request inherits the collation context of the calling module. That context can be overridden in the query (for example, with a declare default collation expression in the prolog), but it will default to the context from the calling module.
If no other collations are in effect (for example, for scheduled tasks), the codepoint collation is used.

Specifying Collations

You can specify collations in many places. Some common places to specify collations are:

In the order by clause of a FLWOR expression.
In an App Server configuration in the Admin Interface.
In a lexicon or Range Index specification in the Admin Interface.
In many W3C standard XQuery functions (for example, fn:compare, fn:contains, fn:starts-with, fn:ends-with, fn:substring-after, fn:substring-before, fn:deep-equals, fn:distinct-values, fn:index-of, fn:max, fn:min).
In the lexicon APIs (cts:words, cts:word-match, cts:element-words, cts:element-values, and so on).
In the range query constructors (cts:element-range-query, cts:element-attribute-range-query).

Collations and Character Sets By Language

The following table lists the languages in which MarkLogic Server supports language-specific tokenization and stemming. It also lists some common collations and character sets for each language.

Language	Base Collations		Character Sets
English	http://marklogic.com/collation/en	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/en/S1	case/diacritic insensitive
	http://marklogic.com/collation/en/S2	diacritic sensitive
	http://marklogic.com/collation/en/S1/EO	case sensitive
French	http://marklogic.com/collation/fr	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/fr/S1	case/diacritic insensitive
	http://marklogic.com/collation/fr/S2	diacritic sensitive
	http://marklogic.com/collation/fr/S1/EO	case sensitive
Italian	http://marklogic.com/collation/it	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/it/S1	case/diacritic insensitive
	http://marklogic.com/collation/it/S2	diacritic sensitive
	http://marklogic.com/collation/it/S1/EO	case sensitive
German	http://marklogic.com/collation/de	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/de/S1	case/diacritic insensitive
	http://marklogic.com/collation/de/S2	diacritic sensitive
	http://marklogic.com/collation/de/S1/EO	case sensitive
	http://marklogic.com/collation/de@collation=phonebook	alternate German collation
Spanish	http://marklogic.com/collation/es	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/es/S1	case/diacritic insensitive
	http://marklogic.com/collation/es/S2	diacritic sensitive
	http://marklogic.com/collation/es/S1/EO	case sensitive
	http://marklogic.com/collation/es@collation=traditional	Treats ll and ch as distinct characters
Russian	http://marklogic.com/collation/ru	case/diacritic sensitive	cp1251 KOI8-R ISO-8859-5
	http://marklogic.com/collation/ru/S1	case/diacritic insensitive
	http://marklogic.com/collation/ru/S2	diacritic sensitive
	http://marklogic.com/collation/ru/S1/EO	case sensitive
Arabic	http://marklogic.com/collation/ar	form-variant sensitive	cp1256 ISO-8859-6
Arabic	http://marklogic.com/collation/ar/S1	form-variant insensitive	cp1256 ISO-8859-6
Chinese (Simplified and Traditional)	http://marklogic.com/collation/zh (simplified)	case/diacritic sensitive	Simplified: GB18030 GB2312 EUC-CN hz-gb-2312 cp936 Traditional: Big5 Big5-HKSCS cp950 GB18030
	http://marklogic.com/collation/zh-Hant (traditional)	case/diacritic sensitive
	http://marklogic.com/collation/zh-Hant@collation=stroke (traditional with simplified order)	locale-specific variant
	http://marklogic.com/collation/zh@collation=pinyan (simplified with traditional order)	locale-specific variant
Korean	http://marklogic.com/collation/ko	case/diacritic sensitive	ISO 2022-KR EUC-KR KS X 1001 cp949 GB12052 KSC 5636
Korean	http://marklogic.com/collation/ko/S1	case/diacritic insensitive	ISO 2022-KR EUC-KR KS X 1001 cp949 GB12052 KSC 5636
Persian (Farsi)	http://marklogic.com/collation/fa	case/diacritic sensitive	cp1256 ISO-8859-6
	http://marklogic.com/collation/fa/S1	case/diacritic insensitive
	http://marklogic.com/collation/fa/S2	diacritic sensitive
	http://marklogic.com/collation/fa/NX	disable normalization
Dutch	http://marklogic.com/collation/nl	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/nl/S1	case/diacritic insensitive
	http://marklogic.com/collation/nl/S2	diacritic sensitive
	http://marklogic.com/collation/nl/S1/EO	case sensitive
Japanese	http://marklogic.com/collation/ja http://marklogic.com/collation/ja/S1	case/diacritic insensitive	Shift JIS: cp932 ibm-942 ibm-943 EUC-JP: EUC-JISX0213 ibm-954 ISO-2022-JP: ISO-2022-JP-1 ISO-2022-JP-2 ISO-2022-JP-3 ISO-2022-JP-2004
		case/diacritic insensitive
	http://marklogic.com/collation/ja/S2	diacritic sensitive
	http://marklogic.com/collation/ja/S1/EO	case sensitive
	http://marklogic.com/collation/ja/S4/HX	Hiragana mode off
Portuguese	http://marklogic.com/collation/pt	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/pt	case/diacritic sensitive
	http://marklogic.com/collation/pt/S1	case/diacritic insensitive
	http://marklogic.com/collation/pt/S2	diacritic sensitive
	http://marklogic.com/collation/pt/S1/EO	case sensitive
Norwegian (Nynorsk and Bokmål)	http://marklogic.com/collation/nn (Nynorsk)	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/nn/S1	case/diacritic insensitive
	http://marklogic.com/collation/nn/S2	diacritic sensitive
	http://marklogic.com/collation/nn/S1/EO	case sensitive
	http://marklogic.com/collation/nb (Bokmål)	case/diacritic sensitive
	http://marklogic.com/collation/nb/S1	case/diacritic insensitive
	http://marklogic.com/collation/nb/S2	diacritic sensitive
	http://marklogic.com/collation/nb/S1/EO	case sensitive
Swedish	http://marklogic.com/collation/sv	case/diacritic sensitive	ISO-8859-1 cp1252
	http://marklogic.com/collation/sv	case/diacritic sensitive
	http://marklogic.com/collation/sv/S1	case/diacritic insensitive
	http://marklogic.com/collation/sv/S2	diacritic sensitive
	http://marklogic.com/collation/sv/S1/EO	case sensitive

All of the languages except English require a license key to enable. If you do not have the license key for one of the supported languages, it is treated as a generic language, and each word is stemmed to itself and it is tokenized in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters). For more information, see Generic Language Support. The language-specific collations are available to all languages, regardless of what languages are enabled in the license key.

« Previous chapter

Next chapter »

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold

Telerik

Kendo UI

Fiddler

Test Studio

MOVEit

WS_FTP

Search Developer's Guide — Chapter 30