Loading TOC...
Loading Content Into MarkLogic Server (PDF)

Loading Content Into MarkLogic Server — Chapter 3

Specifying Encoding and Language

You can specify the encoding and default language while loading a document. You can also automatically detect the encoding or manually detect the language (for example, using xdmp:encoding-language-detect). This section describes how to load documents with a specific encoding or language, and includes the following parts:

For more information about languages, see Language Support in MarkLogic Server in the Search Developer's Guide.

Understanding Character Encoding

MarkLogic Server stores non-binary content in the UTF-8 encoding. MarkLogic Server automatically transcodes the content from the input encoding to UTF-8 during loading. All MarkLogic Server interfaces for ingestion support an encoding option or configuration parameter.

When loading non-binary content with an encoding other than UTF-8, use the option or configuration parameter to specify the input encoding. If the content cannot be converted to UTF-8, MarkLogic Server throws an exception. Specifying an encoding that does not match the input documents can lead to unexpected results.

Note the following about character encodings and conversion:

  • MarkLogic Server always converts non-binary content into UTF-8.
  • If no explicit encoding is specified through options or HTTP headers, the encoding defaults to UTF-8.
  • If no explicit encoding option is specified, but there is an HTTP header specifying the encoding, then that encoding is used.
  • If the encoding is UTF-8, an exception is thrown if the content contains non-UTF-8 characters.
  • MarkLogic Server assumes the character set you specify is actually the character set of the content. If you specify an encoding that is different from the actual encoding of the characters, the result is undefined: An exception might be thrown during loading, or the invalid characters may be translated into the specified encoding incorrectly.

Explicitly Specifying Character Encoding While Loading

The table below summarizes the mechanisms available for explicitly specifying character encoding. See the interface-specific documentation for details. If no encoding is specified, MarkLogic Server defaults to UTF-8 for all non-binary documents.

InterfaceMethodFor Details See
MarkLogic Connector for HadoopThe configuration property mapreduce.marklogic.output.content.encoding.

Configuring a MapReduce Job in the MarkLogic Connector for Hadoop Developer's Guide

MarkLogicConstants in the MarkLogic Hadoop MapReduce Connector API

MarkLogic Content Pump (mlcp)Character encoding cannot be controlled. Only UTF-8 is supported.Loading Content Using MarkLogic Content Pump.
MarkLogic Java APIVarious handles.Conversion of Document Encoding in the Java Application Developer's Guide
REST Client APIThe charset parameter of the HTTP Content-type header. However, Text, XML and JSON content must be UTF-8 encoded.REST Application Developer's Guide




Encoding property of the ContentCreateOptions class

Javadoc for XCC

dotnet for XCC (C# API)

XQueryThe encoding element of the options parameter to xdmp:document-load, xdmp:document-get, xdmp:zip-get, and xdmp:http-get.XQuery and XSLT Reference Guide

The following XQuery example loads the document using the ISO-8859-1 encoding, transcoding the content from ISO-8859-1 to UTF-8 during the load:

  <options xmlns="xdmp:document-load">

Automatically Detecting the Encoding

For those interfaces that support auto-detection of encoding, MarkLogic Server attempts to automatically detect the encoding of non-binary content during loading if the explicitly specified encoding is auto.

The automatic encoding detection chooses an encoding equivalent to the first encoding returned by the xdmp:encoding-language-detect XQuery function. Encoding detection is not an exact science. There are cases where content encoding is ambiguous, but as long as your document is not too small, the encoding detection is fairly accurate. There are, however, cases where auto-detect might choose the wrong encoding.

The following XQuery example demonstrates using automatic character encoding detection when loading a document using xdmp:document-load:

  <options xmlns="xdmp:document-load">

For details, see the interface specific documentation or Explicitly Specifying Character Encoding While Loading.

Inferring the Language and Encoding of a Node in XQuery with xdmp:encoding-language-detect

If you do not want to rely on the automatic detection for the encoding or if you want to detect the language, you can use the xdmp:encoding-language-detect function. The xdmp:encoding-language-detect function returns XML elements, each of which specifies a possible encoding and language for the specified node. Each element also has a score, and the one with the highest score (the first element returned) has the most likely encoding and language.

<encoding-language xmlns="xdmp:encoding-language-detect">
<encoding-language xmlns="xdmp:encoding-language-detect">
<encoding-language xmlns="xdmp:encoding-language-detect">
<encoding-language xmlns="xdmp:encoding-language-detect">

The encoding detection is typically fairly accurate when the score is greater than 10. The language detection tends to be less accurate, however, because it can be difficult to detect the difference between some languages. Because it gives you the raw data, you can use the output from xdmp:encoding-language-detect with whatever logic you want to determine the language. For example, if you happen to know, based on your knowlege of the content, that the language is either Italian or Spanish, you can ignore entries for other languages.

Sometimes the language or the encoding of a block of text is ambiguous, therefore detecting languages and encodings is sometimes prone to error. As a rule, the larger the block of text, the higher the accuracy of the detection. If the size of the block of text you pass into xdmp:encoding-language-detect is more than a few paragraphs of text (several hundred bytes), then the detection is typically fairly accurate.

Specifying the Default Language for XML Documents

The formal or natural language of XML content is determined by the element attribute xml:lang. The language affects how MarkLogic Server tokenizes content, and therefore affects searching and indexing.

When there is no explicit xml:lang attribute on an XML document when it is loaded, MarkLogic Server uses the configured default language for the database. Set the database-wide default language through the language setting in the Admin UI.

You can override the configured default language for the database using load options, as shown by the table below:

InterfaceMethodFor Details See
MarkLogic Connector for HadoopThe configuration property mapreduce.marklogic.output.content.language.

Configuring a MapReduce Job in the MarkLogic Connector for Hadoop Developer's Guide

MarkLogicConstants in the MarkLogic Hadoop MapReduce Connector API

MarkLogic Content Pump-output_language command line optionImporting Content Into MarkLogic Server in the mlcp User Guide
XCCContentCreateOptions.setLanguageJavadoc for XCC
XQuerySet the <default-language> element of the <options> node passed to xdmp:document-load, xdmp:document-get, xdmp:http-get, or xdmp:zip-get.XQuery and XSLT Reference Guide

For details on languages, see Language Support in MarkLogic Server in the Search Developer's Guide.

« Previous chapter
Next chapter »
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy