Loading Content Into MarkLogic Server (PDF)

Loading Content Into MarkLogic Server — Chapter 3

« Previous chapter
Next chapter »

Specifying Encoding and Language

You can specify the encoding and default language while loading a document. You can also automatically detect the encoding or manually detect the language (for example, using xdmp:encoding-language-detect). This section describes how to load documents with a specific encoding or language, and includes the following parts:

For more information about languages, see Language Support in MarkLogic Server in the Search Developer's Guide.

Understanding Character Encoding

MarkLogic Server stores all content in the UTF-8 encoding. If you try to load non-UTF-8 content into MarkLogic Server without translating it to UTF-8, the server throws an exception. If you have non-UTF-8 content, then you can specify the encoding for the content during ingestion, and MarkLogic Server will translate it to UTF-8. If the content cannot be translated, MarkLogic Server throws an exception indicating that there is non-UTF-8 content.

You can specify the encoding for content using either an encoding option on the ingestion function or via HTTP headers. For details, see Character Encoding in the Search Developer's Guide.

Explicitly Specifying Character Encoding While Loading

The table below summarizes the mechanisms available for explicitly specifying character encoding. See the interface-specific documentation for details. If no encoding is specified, MarkLogic Server defaults to UTF-8 for all non-binary documents.

Interface Method For Details See
MarkLogic Connector for Hadoop The configuration property mapreduce.marklogic.output.content.encoding. MarkLogicConstants in the MarkLogic Hadoop MapReduce Connector API
MarkLogic Content Pump (mlcp) Character encoding cannot be controlled. Only UTF-8 is supported. Loading Content Using MarkLogic Content Pump.
MarkLogic Java API Various handles. Conversion of Document Encoding in the Java Application Developer's Guide
REST Client API The charset parameter of the HTTP Content-type header. However, Text, XML and JSON content must be UTF-8 encoded. REST Application Developer's Guide
XCC

Java:

ContentCreateOptions.setEncoding

XCC:

Encoding property of the ContentCreateOptions class

Javadoc for XCC

dotnet for XCC (C# API)

XQuery The encoding element of the options parameter to xdmp:document-load, xdmp:document-get, xdmp:zip-get, and xdmp:http-get. XQuery and XSLT Reference Guide

The following XQuery example loads the document using the ISO-8859-1 encoding, transcoding the content from ISO-8859-1 to UTF-8 during the load:

xdmp:document-load("c:/tmp/my-document.xml",
  <options xmlns="xdmp:document-load">
    <uri>/my-document.xml</uri>
    <encoding>ISO-8859-1</encoding>
  </options>)

Automatically Detecting the Encoding

For those interfaces that support auto-detection of encoding, MarkLogic Server attempts to automatically detect the encoding of non-binary content during loading if the explicitly specified encoding is auto.

The automatic encoding detection chooses an encoding equivalent to the first encoding returned by the xdmp:encoding-language-detect XQuery function. Encoding detection is not an exact science. There are cases where content encoding is ambiguous, but as long as your document is not too small, the encoding detection is fairly accurate. There are, however, cases where auto-detect might choose the wrong encoding.

The following XQuery example demonstrates using automatic character encoding detection when loading a document using xdmp:document-load:

xdmp:document-load("c:/tmp/my-document.xml",
  <options xmlns="xdmp:document-load">
    <uri>/my-document.xml</uri>
    <encoding>auto</encoding>
  </options>)

For details, see the interface specific documentation or Explicitly Specifying Character Encoding While Loading.

Inferring the Language and Encoding of a Node in XQuery with xdmp:encoding-language-detect

If you do not want to rely on the automatic detection for the encoding or if you want to detect the language, you can use the xdmp:encoding-language-detect function. The xdmp:encoding-language-detect function returns XML elements, each of which specifies a possible encoding and language for the specified node. Each element also has a score, and the one with the highest score (the first element returned) has the most likely encoding and language.

xdmp:encoding-language-detect(
  xdmp:document-get("c:/tmp/session-login.css"))
=>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>en</language>
  <score>14.91</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>ro</language>
  <score>13.47</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>it</language>
  <score>12.84</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>fr</language>
  <score>12.71</score>
</encoding-language>
...

The encoding detection is typically fairly accurate when the score is greater than 10. The language detection tends to be less accurate, however, because it can be difficult to detect the difference between some languages. Because it gives you the raw data, you can use the output from xdmp:encoding-language-detect with whatever logic you want to determine the language. For example, if you happen to know, based on your knowlege of the content, that the language is either Italian or Spanish, you can ignore entries for other languages.

Sometimes the language or the encoding of a block of text is ambiguous, therefore detecting languages and encodings is sometimes prone to error. As a rule, the larger the block of text, the higher the accuracy of the detection. If the size of the block of text you pass into xdmp:encoding-language-detect is more than a few paragraphs of text (several hundred bytes), then the detection is typically fairly accurate.

Specifying the Default Language for XML Documents

The formal or natural language of XML content is determined by the element attribute xml:lang. The language affects how MarkLogic Server tokenizes content, and therefore affects searching and indexing.

When there is no explicit xml:lang attribute on an XML document when it is loaded, MarkLogic Server uses the configured default language for the database. Set the database-wide default language through the language setting in the Admin UI.

You can override the configured default language for the database using load options, as shown by the table below:

Interface Method For Details See
MarkLogic Connector for Hadoop The configuration property mapreduce.marklogic.output.content.language. MarkLogicConstants in the MarkLogic Hadoop MapReduce Connector API
MarkLogic Content Pump -output_language command line option Importing Content Into MarkLogic Server in the mlcp User Guide
XCC ContentCreateOptions.setLanguage Javadoc for XCC
XQuery Set the <default-language> element of the <options> node passed to xdmp:document-load, xdmp:document-get, xdmp:http-get, or xdmp:zip-get. XQuery and XSLT Reference Guide

For details on languages, see Language Support in MarkLogic Server in the Search Developer's Guide.

« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy