Loading TOC...
Matches for cat:guide/mlcp (cat:guide) have been highlighted. remove
mlcp User Guide (PDF)

MarkLogic Server 11.0 Product Documentation
mlcp User Guide
— Chapter 4

Importing Content Into MarkLogic Server

You can use mlcp to insert content into a MarkLogic Server database from flat files, compressed ZIP and GZIP files, aggregate XML files, and MarkLogic Server database archives. The input data can be accessed from the native filesystem.

For a list of import related options, see Import Command Line Options.

This chapter covers the following topics:

Supported Input Format Summary

Use the -input_file_type option to tell mlcp the format of the data in each input file (or each entry inside a compressed file). This option controls if/how mlcp converts the content into database documents.

The default input type is documents, which means each input file or ZIP file entry creates one database document. All other input file types represent composite input formats which can yield multiple database documents per input file.

The following table provides a quick reference of the supported input file types, along with the allowed document types for each, and whether or not they can be passed to mlcp as compressed files.

-input_file_type Document Type -input_compressed permitted
documents XML, JSON, text, or binary; controlled with -document_type. Yes
archive As in the database: XML, JSON, text, and/or binary documents, plus metadata. The type is not under user control. No (archives are already in compressed format)
delimited_text XML or JSON Yes
delimited_json JSON Yes
sequencefile

XML, text or binary; controlled with these options:

-input_sequencefile_value_class

-input_sequencefile_value_type.

No. However, the contents can be compressed when you create the sequence file. Compression is bound up with the value class you use to generate and import the file.
aggregates XML Yes
rdf Serialized RDF triples, in one of several formats. For details, see Supported RDF Triple Formats in the Semantic Graph Developer's Guide. RDF/JSON is not supported. Yes
forest As in the database: XML, JSON, text, and/or binary documents. The type is not under user control. No

When the input file type is documents or sequencefile you must consider both the input format (-input_file_type) and the output document format (-document_type). In addition, for some input formats, input can come from either compressed or uncompressed files (-input_compressed).

The -document_type option controls the database document format when -input_file_type is documents or sequencefile. MarkLogic Server supports text, JSON, XML, and binary documents. If the document type is not explicitly set with these input file types, mlcp uses the input file suffix to determine the type. For details, see How mlcp Determines Document Type.

You cannot use mlcp to perform document conversions. Your input data should match the stated document type. For example, you cannot convert XML input into a JSON document just by setting -document_type json.

Understanding Input File Path Resolution

If you do not explicitly include a URI scheme prefix such as file: on the input file path, mlcp uses the following rules to locate the input path:

  • In local mode, mlcp defaults to the local file system (file).

The following example loads files from the local filesystem directory /space/bill/data:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -input_file_path /space/bill/data -mode local

Controlling Database URIs During Ingestion

By default, the document URIs created by mlcp during ingestion are determined by the input source. The tool supports several command line options for modifying this default behavior.

Default Document URI Construction

The default database URI assigned to ingested documents depends on the input source. Loading content from the local filesystem can create different URIs than loading the same content from a ZIP file or archive. Command line options are available for you to modify this behavior. You can use options to generate different URIs; for details, see Transforming the Default URI.

The following table summarizes the default behavior with several input sources:

Input Source Default URI Example
documents in a native directory

/path/filename

Note that on Windows, the device (c:) becomes a path step, so c:\path\file becomes /c:/path/file.

/space/data/bill/dream.xml

/c:/data/bill/dream.xml

documents in a ZIP or GZIP file /compressed-file-path/path/inside/zip/filename If the input file is /space/data/big.zip and it contains a directory entry bill/, then the document URI for dream.xml in that directory is: /space/data/big.zip/bill/dream.xml
a GZIP compressed document /path/filename-without-gzip-suffix If the input is /space/data/big.xml.gz, the result is /space/data/big.xml.
delimited text file The value in the column used as the id. (The first column, by default). For a record of the form first,second,third where Column 1 is the id: first
archive or forest The document URI from the source database.
sequence file The key in a key-value pair

aggregate XML

line delimited JSON

/path/filename-split_start-seqnum

Where /path/filename is the full path to the input file, split_start is the byte position from the beginning of the split, and seqnum begins with 1 and increments for each document created.

For input file /space/data/big.xml:

/space/data/big.xml-0-1 /space/data/big.xml-0-2

For input file /space/data/big.json:

/space/data/big.json-0-1 /space/data/big.json-0-2

RDF A generated unique name c7f92bccb4e2bfdc-0-100.xml

For example, the following command loads all files from the file systemdirectory /space/bill/data into the database attached to the App Server on port 8000. The documents inserted into the database have URIs of form /space/bill/data/filename.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -input_file_path /space/bill/data -mode local

If the /space/bill/data directory is zipped up into bill.zip, such that bill/ is the root directory in zip file, then the following command inserts documents with URIs of the form bill/data/filename:

# Windows users, see Modifying the Example Commands for Windows 
$ cd /space; zip -r bill.zip bill
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -input_file_path /space/bill.zip \
    -mode local -input_compressed true

When you use the -generate_uri option to have mlcp generate URIs for you, the generated URIs follow the same pattern as for aggregate XML and line delimited JSON:

/path/filename-split_start-seqnum

The generated URIs are unique across a single import operation, but they are not globally unique. For example, if you repeatedly import data from some file /tmp/data.csv, the generated URIs will be the same each time (modulo differences in the number of documents inserted by the job).

Transforming the Default URI

Use the following options to tailor the database URI of inserted documents:

  • -output_uri_replace performs one or more string substitutions on the default URI.
  • -output_uri_prefix prepends a string to the URI after substitution.
  • -output_uri_suffix appends a string to the URI after substitution.

The -output_uri_replace option accepts a comma delimited list of regular expression and replacement string pairs. The string portion must be enclosed in single quotes:

-output_uri_replace pattern,'string',pattern,'string'

For details on the regular expression language supported by -output_uri_replace, see Regular Expression Syntax.

These options are applied after the default URI is constructed and encoded, so if the option values contain characters not allowed in a URI, you must encode them yourself. See Character Encoding of URIs.

The following example loads documents from the filesystem directory /space/bill/data. The default output URIs would be of the form /space/bill/data/filename. The example uses -output_uri_replace to replace bill/data with will and strip off /space/, and then adds a /plays prefix using -output_uri_prefix. The end result is output URIs of the form /plays/will/filename.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -input_file_path /space/bill/data -mode local \
    -output_uri_replace "/space,'',/bill/data/,'/will/'" \
    -output_uri_prefix /plays
Character Encoding of URIs

If a URI constructed by mlcp contains special characters that are not allowed in URIs, mlcp automatically encodes them. This applies to the special characters (space), %, ? or #. For example, foo bar.xml becomes foo%20bar.xml.

If you supply a URI or URI component, you are responsible for ensuring the result is a legitimate URI. No automatic encoding takes place. This applies to -output_uri_replace, -output_uri_prefix, and -output_uri_suffix. The changes implied by these options are applied after mlcp encodes the default URI.

When mlcp exports documents from the database to the file system such that the output directory and/or file names are derived from the document URI, the special symbols are decoded. That is, foo%bar.xml becomes foo bar.xml when exported. For details, see How URI Decoding Affects Output File Names.

How mlcp Determines Document Type

The document type determines what kind of database document mlcp inserts from input content: Text, XML, JSON, or binary. Document type is determined in the following ways:

  • Document type can be inherent in the input file type. For example, aggregates and rdf input files always insert XML documents. For details, see Supported Input Format Summary.
  • You can specify a document type explicitly with -document_type. For example, to load documents as XML, use -input_file_type documents -document_type xml. You cannot set an explicit type for all input file types.
  • mlcp can determine document type dynamically from the output document URI and the MarkLogic Server MIME type mappings when you use -input_file_type documents -document_type mixed.

If you set -document_type to an explicit type such as -document_type json, then mlcp inserts all documents as that type.

If you use -document_type mixed, then mlcp determines the document type from the output URI suffix and the MIME type mapping configured into MarkLogic Server. Mixed is the default behavior for -input_file_type documents.

You can only use -document_type mixed when the input file type is documents.

If an unrecognized or unmapped file extension is encountered when loading mixed documents, mlcp creates a binary document.

The following table contains examples of applying the default MIME type mappings to output URIs with various file extensions, an unknown extension, and no extension. The default mapping includes many additional suffixes. You can examine and create MIME type mappings under the Mimetypes section of the Admin Interface. For more information, see Implicitly Setting the Format Based on the MIME Type in the Loading Content Into MarkLogic Server Guide.

URI Document Type
/path/doc.xml
XML
/path/doc.json
JSON
/path/doc.jpg
binary
/path/doc.txt
text
/path/doc.unknown
binary
/path/doc-nosuffix
binary

The MIME type mapping is applied to the final output URI. That is, the URI that results from applying the URI transformation options described in Controlling Database URIs During Ingestion. The following table contains examples of how URI transformations can affect the output document type in mixed mode, assuming the default MIME type mappings.

Input Filename URI Options Output URI Doc Type
/path/doc.1
None
/path/file.1
binary
/path/doc.1

Add a .xml suffix:

-output_uri_suffix ".xml"

/path/file.xml
XML
/path/doc.1

Replace the unmapped suffix with .txt:

-output_uri_replace "\.\d+,'.txt'"

/path/file.txt
text

Document type determination is completed prior to invoking server side transformations. If you change the document type in a transformation function, you are responsible for changing the output document to match. For details, see Transforming Content During Ingestion.

Loading Documents from a Directory

This section discusses importing documents stored as flat files on the native filesystem. The following topics are covered:

Loading a Single File

Use the following procedure to load all the files in a native directory and its sub-directories. To load selected files, see Filtering Documents Loaded From a Directory.

  1. Set -input_file_path to the path to the input file.
  2. Set -input_file_type if your input files are not documents. For example, if loading from delimited text files, sequence files, aggregate XML files, RDF triples files, or database archives.
  3. Set -document_type if -input_file_type is not documents and the content type cannot be accurately deduced from the file suffixes as described in How mlcp Determines Document Type.
  4. Set -mode:
    • To perform the work locally, set -mode to local.

By default, the imported document has a database URI based on the input file path. For details, see Controlling Database URIs During Ingestion.

The following example command loads a single XML file:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -input_file_path /space/bill/data/hamlet.xml

Loading All the Files in a Directory

Use the following procedure to load all the files in a native directory and its sub-directories. To load selected files, see Filtering Documents Loaded From a Directory.

  1. Set -input_file_path to the input directory.
  2. Set -input_file_type if your input files are not documents. For example, if loading from delimited text files, sequence files, aggregate XML files, or database archives.
  3. Set -document_type if -input_file_type is not documents and the content type cannot be accurately deduced from the file suffixes as described in How mlcp Determines Document Type.
  4. Set -mode:
    • To perform the work locally, set -mode to local.

By default, the imported documents have database URIs based on the input file path. For details, see Controlling Database URIs During Ingestion.

The following example command loads all the files in /space/bill/data:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -input_file_path /space/bill/data

Filtering Documents Loaded From a Directory

If -input_file_path names a directory, mlcp loads all the documents in the input directory and subdirectories by default. Use the -input_file_pattern option to filter the loaded documents based on a regular expression.

Input document filtering is handled differently for -input_file_type forest. For details, see Filtering Forest Contents.

For example, the following command loads only files with a .xml suffix from the directory /space/bill/data:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -input_file_path /space/bill/data \
    -mode local -input_file_pattern '.*\.xml'

The mlcp tool uses Java regular expression syntax. For details, see Regular Expression Syntax.

Loading Documents From Compressed Files

You can load content from one or more compressed files. Filtering of compressed file content is not supported; mlcp loads all documents in a compressed file.

Follow this procedure to load content from one or more ZIP or GZIP compressed files.

  1. Set -input_file_path:
    • To load from a single file, set -input_file_path to the path to the compressed file.
    • To load from multiple files, set -input_file_path to a directory containing the compressed files.
  2. If the content type cannot be accurately deduced from suffixes of the files inside the compressed file as described in How mlcp Determines Document Type, set -document_type appropriately.
  3. Set -input_compressed to true.
  4. If the compressed file suffix is not .zip or .gzip, specify the compressed file format by setting -input_compression_codec to zip or gzip.

If you set -document_type to anything but mixed, then the contents of the compressed file must be homogeneous. For example, all XML, all JSON, or all binary.

The following example command loads binary documents from the compressed file /space/images.zip on the local filesystem.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -mode local -document_type binary \
    -input_file_path /space/images.zip -input_compressed

The following example loads all the files in the compressed file /space/example.jar, using -input_compression_codec to tell mlcp the compression format because of the .jar suffix:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -mode local -input_file_path /space/example.jar \
    -input_compressed true -input_compression_codec zip

If -input_file_path is a directory, mlcp loads contents from all compressed files in the input directory, recursing through subdirectories. The input directory must not contain other kinds of files.

By default, the URI prefix on documents loaded from a compressed file includes the full path to the input compressed file and mirrors the directory hierarchy inside the compressed file. For example, if a ZIP file /space/shakespeare.zip contains bill/data/dream.xml then the ingested document URI is /space/shakespeare.zip/bill/data/dream.xml. To override this behavior, see Controlling Database URIs During Ingestion.

Loading Content and Metadata From an Archive

Follow this procedure to import content and metadata from a database archive created by the mlcp export command. A database archive is stored in one or more compressed files that contain documents and metadata.

  1. Set -input_file_path:
    • To load a single archive file, set -input_file_path to that file.
    • To load multiple archive files, set -input_file_path to a directory containing the compressed archive files.
  2. Set -document_type to mixed, or leave it unset since mixed is the default setting.
  3. Set -input_compressed to true.
  4. Set -input_file_type to archive.
  5. If the input archive was created without any metadata, set -archive_metadata_optional to true. If this is not set, an exception is thrown if the archive contains no metadata.
  6. If you want to exclude some or all of the document metadata in the archive:
    • Set -copy_collections to false to exclude document collections metadata.
    • Set -copy_permissions to false to exclude document permissions metadata.
    • Set -copy_properties to false to exclude document properties.
    • Set -copy_quality to false to exclude document quality metadata.
    • Set -copy_metadata to false to exclude key-value metadata.

An archive is assumed to contain metadata. However, it is possible to create archives without metadata by setting all the metadata copying options (-copy_collections, -copy_permissions, etc.) to false during export. If an archive does not contain metadata, you must set -archive_metadata_optional to tell mlcp to proceed in the absence of metadata.

When you import properties from an archive, you should disable the maintain last modified configuration option on the destination database during the import. Otherwise, you can get an XDMP-SPECIALPROP error if the import operation tries to update the last modified property. To disable this setting, use the Admin Interface or the library function admin:set-maintain-last-modified.

The following example command loads the database archive in /space/archive_dir:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -mode local -input_file_type archive \
    -input_file_path /space/archive_dir

Splitting Large XML Files Into Multiple Documents

Very large XML files often contain aggregate data that can be disaggregated by splitting it into multiple smaller documents rooted at a recurring element. Disaggregating large XML files consumes fewer resources during loading and improves performance when searching and retrieving content. For aggregate JSON handling, see Creating Documents from Line-Delimited JSON Files.

The following mlcp options support creating multiple documents from aggregate data:

  • -aggregate_record_element
  • -uri_id
  • -aggregate_record_namespace

You can disaggregate XML when loading from either flat or compressed files. For more information about working with compressed files, see Loading Documents From Compressed Files.

Follow this procedure to create documents from aggregate XML input:

  1. Set -input_file_path:
    • To load from a single file, set -input_file_path to the path to the aggregate XML file.
    • To load from multiple files, set -input_file_path to a directory containing the aggregate files. The directory must not contain other kinds of files.
  2. If you are loading from a compressed file, set -input_compressed.
  3. Set -input_file_type to aggregates.
  4. Set -aggregate_record_element to the element QName of the node to use as the root for all inserted documents. See the example below. The default is the first child element under the root element.

    The element QName should appear at only one level. You cannot specify the element name using a path, so disaggregation occurs everywhere that name is found.

  5. Optionally, override the default document URI by setting -uri_id to the name of the element from which to derive the document URI.
  6. If the aggregate record element is in a namespace, set -aggregate_record_namespace to the input namespace.

The default URI is hashcode-seqnum in local mode. If there are multiple matching elements, the first match is used.

If your aggregate URI id's are not unique, you can overwrite one document in your input set with another. Importing documents with non-unique URI id's from multiple threads can also cause deadlocks.

The example below uses the following input data:

$ cat > example.xml
<?xml version="1.0" encoding="UTF-8"?>
<people>
  <person>
    <first>George</first>
    <last>Washington</last>
  </person>
  <person>
    <first>Betsy</first>
    <last>Ross</last>
  </person>
</people>

The following command breaks the input data into a document for each <person> element. The -uri_id and other URI options give the inserted documents meaningful names. The command creates URIs of the form /people/lastname.xml by using the <last/> element as the aggregate URI id, along with an output prefix and suffix:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -mode local -input_file_path example.xml \
    -input_file_type aggregates -aggregate_record_element person \
    -uri_id last -output_uri_prefix /people/ \
    -output_uri_suffix .xml

The command creates two documents: /people/Washington.xml and /people/Ross.xml. For example, /people/Washington.xml contains:

<?xml version="1.0" encoding="UTF-8"?>
<person>
    <first>George</first>
    <last>Washington</last>
</person>

If the input data is in a namespace, set -aggregate_record_namespace to that namespace. For example, if the input data is modified to include a namespace:

$ cat > example.xml
<?xml version="1.0" encoding="UTF-8"?>
<people xmlns="http://marklogic.com/examples">...</people>

Then mlcp ingests no documents unless you set -aggregate_record_namespace. Setting the namespace creates two documents in the namespace http://marklogic.com/examples. For example, after running the following command:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -mode local -input_file_path example.xml \
    -input_file_type aggregates -aggregate_record_element person \
    -uri_id last -output_uri_prefix /people/ \
    -output_uri_suffix .xml \
    -aggregate_record_namespace "http://marklogic.com/examples"

The document with URI /people/Washington.xml contains :

<?xml version="1.0" encoding="UTF-8"?>
<person xmlns="http://marklogic.com/examples">
    <first>George</first>
    <last>Washington</last>
</person>

Creating Documents from Delimited Text Files

Use the delimited_text input file type to import content from a delimited text file and create an XML or JSON document corresponding to each line. For line-delimited JSON data, see Creating Documents from Line-Delimited JSON Files.

The following options are commonly used in the generation of documents from delimited text files:

  • -input_file_type delimited_text
  • -document_type xml or -document_type json
  • -delimiter
  • -uri_id
  • -delimited_root_name (XML output only)
  • -data_type (JSON output only)

The use of these and other supporting options is covered in the following topics:

Example: Generating Documents From a CSV File

When you import content from delimited text files, mlcp creates an XML or JSON document for each line of input after the initial header line.

The default document type is XML. To create JSON documents, use -document_type json.

When creating XML documents, each document has a root node of <root> and child elements with names corresponding to each column title. You can override the default root element name using the -delimited_root_name option; for details, see Customizing XML Output.

When creating JSON documents, each document is rooted at an unnamed object containing JSON properties with names corresponding to each column title. By default, the values for JSON are always strings. Use -data_type to override this behavior; for details, see Controlling Data Type in JSON Output.

For example, if you have the following data and mlcp command:

# Windows users, see Modifying the Example Commands for Windows 
$ cat example.csv
first,last
george,washington
betsy,ross
$ mlcp.sh ... -mode local -input_file_path /space/mlcp/data \
    -input_file_type delimited_text ...

Then mlcp creates the XML output shown in the table below. To generate the JSON output, add -document_type json to the mlcp command line.

XML Output JSON Output
<root>
  <first>george</first>
  <last>washington</last>
</root>

<root>
  <first>betsy</first>
  <last>ross</last>
</root>
{
  "first": "george",
  "last": "washington"
}

{
  "first": "betsy",
  "last": "ross"
}

Expected Input Format

A delimited text input file must have the following characteristics:

  • The first line in the input file contains column names that are used to create the XML element or JSON property names of each document created from the file.
  • The same delimiter is used to separate each value, as well as the column names. The default separator is a comma; use -delimiter to override it; for details, see Specifying the Field Delimiter.
  • Every line has the same number of fields (values). Empty fields are represented as two delimiters in a row, such as a,b,,d.

For example, the following data meets the input format requirements:

first,last
george,washington
betsy,ross

This data produces documents with XML elements or JSON properties named first and last.

Customizing XML Output

When creating XML documents, each document has a root node of <root> and child elements with names corresponding to each column title. You can override the default root element name using the -delimited_root_name option. You can use the -namespace option to specify a root namespace.

The following example produces documents with root element <person> in the namespace http://my.namespace.

$ mlcp.sh ... -mode local -input_file_path /space/mlcp/data \
    -input_file_type delimited_text -namespace http://my.namespace \
    -delimited_root_name person
...
<person xmlns="http://my.namespace">
  <first>george</first>
  <last>washington</last>
</person>
...

Controlling Data Type in JSON Output

When creating JSON documents, the default value type is string. You can use the -data_type option to specify explicit data types for some or all columns. The options accepts comma-separated list of columnName,typeName pairs, where the typeName can be one of number, boolean, or string.

For example, if you have an input file called catalog.csv that looks like the following:

id, price, in-stock
12345, 8.99, true
67890, 2.00,false

Then the default output documents look similar to the following. Notice that all the property values are strings.

{ "id": "12345",
  "price": "8.99",
  "in-stock: "true"
}

The following example command uses the -data_type option to make the price property a number value and the in-stock property a boolean value. Since the id field is not specified in the -data_type option, it remains a string.

$ mlcp.sh ... -mode local -input_file_path catalog.csv \
    -input_file_type delimited_text -document_type json \
    -data_type "price,number,in-stock,boolean"
...
{ "id": "12345",
  "price": 8.99,
  "in-stock: true
}

Controlling the Output Document URI

By default, the document URIs use the value in the first column. For example, if your input data looks like the following:

first,last
george,washington
betsy,ross

Then importing this data with no URI related options creates two documents with name corresponding to the first value. The URI will be george and betsy.

Use -uri_id to choose a different column or -generate_uri to have MarkLogic Server automatically generate a unique URI for each document. For example, the following command creates the documents washington and ross:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh ... -mode local -input_file_path /space/mlcp/data \
    -input_file_type delimited_text -uri_id last

Note that URIs generated with -generate_uri are only guaranteed to be unique across your import operation. For details, see Default Document URI Construction.

You can further tailor the URIs using -output_uri_prefix and -output_uri_suffix. These options apply even when you use -generate_uri. For details, see Controlling Database URIs During Ingestion.

If your URI id's are not unique, you can overwrite one document in your input set with another. Importing documents with non-unique URI id's from multiple threads can also cause deadlocks.

Specifying the Field Delimiter

The default delimiter between fields is a comma (,). You can override this using the -delimiter option. If the delimiter is a character that is difficult to specify on the command line, specify the delimiter in an options file instead. For details, see Options File Syntax.

For example, the Linux bash shell parser makes it difficult to specify a tab delimiter on the command line, so you can put the options in a file instead. In the example options file below, the string literal after -delimiter should contain a tab character.

$ cat delim.opt
-input_file_type
delimited_text
-delimiter
"tab"
$ mlcp.sh import ... -mode local -input_file_path /space/mlcp/data \
    -options_file delim.opt

Optimizing Ingestion of Large Files

If your delimited text files are very large, consider using the -split_input option. When this option is true, mlcp attempts to break each input file into multiple splits, enabling more documents to be loaded in parallel. For details, see Improving Throughput with -split_input.

Creating Documents from Line-Delimited JSON Files

Use the delimited_json input file type to import content from a line-delimited JSON file and create a JSON document corresponding to each line.

This section covers the following topics:

To create JSON documents from delimited text files such as CSV files, see Creating Documents from Delimited Text Files. For aggregate XML input, see Splitting Large XML Files Into Multiple Documents.

Line-Delimited JSON Overview

A line-delimited JSON file is a type of aggregate file where each line is a self-contained piece of JSON data, such as an object or array.

Usually, each line of input has similar structure, such as the following:

{"id": "12345","price":8.99, "in-stock": true}
{"id": "67890","price":2.00, "in-stock": false}

However, the JSON data on each line is independent of the other lines, so the lines do not have to contain JSON data of the same shape. For example, the following is a valid input file:

{"first": "george", "last": "washington"}
{"id": 12345, "price": 8.99, "in-stock": true}

Given the input shown below, the following command creates 2 JSON documents. Each document contains the data from a single line of input.

$ cat example.json
{"id": "12345","price":8.99, "in-stock": true}
{"id": "67890","price":2.00, "in-stock": false}

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -mode local -input_file_path example.json \
    -input_file_type delimited_json

The example command creates documents whose contents precisely mirror each of input:

{"id": "12345","price":8.99, "in-stock": true}

{"id": "67890","price":2.00, "in-stock": false}

Controlling the Output Document URI

The default document URI is generated from the input file name, the split number, and a sequence number within the split, as described in Default Document URI Construction. For example, if the input file absolute path is /space/data/example.json, then the default output document URIs have the following form:

/space/data/example.json-0-1
/space/data/example.json-0-2
...

You can base the URI on values in the content instead by using the -uri_id option to specify the name of a property found in the data. You can further tailor the URIs using -output_uri_prefix and -output_uri_suffix. For details, see Controlling Database URIs During Ingestion.

For example, the following command uses the value in the id field as the base of the URI and uses -output_uri_suffix to add a .json suffix to the URIs:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh ... -mode local -input_file_path /space/data/example.json \
    -input_file_type delimited_json 
    -uri_id id -output_uri_suffix ".json"

Given these options, an input line of the form shown below produces a document with the URI 12345.json instead of /space/data/example.json-0-1.

{"id": "12345","price":8.99, "in-stock": true}

If the property name specified with -uri_id is not unique in your data, mlcp will use the first occurrence found in a breadth first search. The value of the specified property should be a valid number or string.

If you use -uri_id, any records (lines) that do not contain the named property are skipped. If the property is found but the value is null or not a number or string, the record is skipped.

Loading Triples

This section provides a brief overview of loading semantic data into MarkLogic Server. For more details, see the Semantic Graph Developer's Guide. The following topics are covered in this section:

Basics of Triple Loading

To load semantic triples, use -input_file_type rdf and follow the instructions for loading a single file, all files in a directory, or a compressed file. For example, the following command loads triples files from the directory /my/data.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -username user -password password -host localhost \
    -port 8000 -input_file_path /my/data -mode local \
    -input_file_type rdf 

You can use mlcp to load triples files in several formats, including RDF/XML, Turtle, and N-Quads. For a full list of supported formats, see Supported RDF Triple Formats in Semantic Graph Developer's Guide.

Each time you load triples from a file, mlcp inserts new documents into the database. That is, multiple loads of the same input inserts new triples each time, rather than overwriting. Only the XQuery and REST API allow you replace triples.

Load triples data embedded within other content according to the instructions for the enclosing input file type, rather than with -input_file_type rdf. For example, if you have an XML input document that happens to have some triples embedded in it, load the document using -input_file_type documents.

You cannot combine loading triples files with other input file types.

If you do not include any graph selection options in your mlcp command, Quads are loaded into the graph specified in the data. Quads with no explicit graph specification and other kinds of triple data are loaded into the default graph. You can change this behavior with options. For details, see Graph Selection When Loading Quads or Graph Selection for Other Triple Types.

For details, see Loading Triples with mlcp in Semantic Graph Developer's Guide.

Graph Selection When Loading Quads

When loading quads, you can use the following command line options to control the graph into which your quads are loaded:

  • -output_graph
  • -output_override_graph
  • -output_collections

You can use -output_collections by itself or with the other two options. You cannot use -output_graph and -output_override_graph together.

If your semantic data is not in a quad format like N-Quads, see Graph Selection for Other Triple Types.

Quads interact with these options differently than other triple formats because quads can include a graph IRI in each quad. The following table summarizes the affect of various option combinations when importing quads with mlcp:

Graph Options Description
none For quads that contain an explicit graph IRI, load the triple into that graph. For quads with no explicit graph IRI, load the triple into the default graph. The default graph URI is http://marklogic.com/semantics#default-graph.
-output_graph For quads that contain an explicit graph IRI, load the triple into that graph. For quads with no explicit graph IRI, load the triple into the graph specified by -output_graph.
-output_override_graph Load all triples into the graph specified by -output_override_graph. This graph overrides any graph IRIs contained in the quads.
-output_collections Similar to -output_override_graph, but you can specifiy multiple collections. Load triples into the graph specified as the first (or only) collection; also add triples to any additional collections on the list. This overrides any graph IRIs contained in the quads.
-output_graph with -output_collections For quads that contain an explicit graph IRI, load the triple into that graph. For quads with no explicit graph IRI, load the triple into the graph specified by -output_graph. Also add triples to the specified collections.
-output_override_graph with -output_collection Load all triples into the graph specified by -output_override_graph. This graph overrides any graph IRIs contained in the quads. Also add triples to the specified collections.

For more details, see Loading Triples with mlcp in the Semantic Graph Developer's Guide.

For example, suppose you load the following N-Quad data with mlcp. There are 3 quads in the data set. The first and last quad include a graph IRI, the second quad does not.

<http://one.example/subject1> <http://one.example/predicate1>
    <http://one.example/object1> <http://example.org/graph3> .
_:subject1 <http://an.example/predicate1> "object1"  .
_:subject2 <http://an.example/predicate2> "object2"
    <http://example.org/graph5> .

If you use a command similar to the following load the data:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -username user -password password -host localhost \
    -port 8000 -input_file_path /my/data.nq -mode local \
    -input_file_type rdf 

Then the table below illustrates how the various graph related options affect how the triples are loaded into the database:

Graph Options Result
none
Graphs:   http://example.org/graph3   http://marklogic.com/semantics#default-graph   http://example.org/graph5
-output_graph /my/graph
Graphs:   http://example.org/graph3   /my/graph   http://example.org/graph5
-output_override_graph /my/graph
Graphs:   /my/graph for all triples
-output_collections "aa,bb,cc"
Graphs:   aa for all triples All triples also added to collections bb and cc
-output_graph /my/graph
-output_collections "bb,cc"
Graphs:   http://example.org/graph3   /my/graph   http://example.org/graph5 All triples also added to collections bb and cc
-output_override_graph /my/graph
-output_collections "bb,cc"
Graphs:   /my/graph for all triples All triples also added to collections bb and cc

Graph Selection for Other Triple Types

When loading triples (rather than quads), you can use the following command line options to control the graph into which your triples are loaded:

  • -output_graph
  • -output_collections

The following table summarizes the affect of various option combinations when importing triples with mlcp. For quads, see Graph Selection When Loading Quads.

Graph Options Description
none Load triples into the default graph (http://marklogic.com/semantics#default-graph).
-output_graph Load triples into the specified graph.
-output_collections Load triples into the graph specified as the first (or only) collection; also add triples to any additional collections on the list.
-output_graph with -output_collections Load triples into the graph specified by -output_graph and also add them to the specified collections.

For more details, see Loading Triples with mlcp in the Semantic Graph Developer's Guide.

For example, if you use a command similar to the following load triples data:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -username user -password password -host localhost \
    -port 8000 -input_file_path /my/data.nt -mode local \
    -input_file_type rdf

Then the table below illustrates how the various graph related options affect how the triples are loaded into the database:

Graph Options Result
none Graph:   http://marklogic.com/semantics#default-graph
-output_graph /my/graph
Graph: /my/graph
-output_collections "aa,bb,cc"
Graph: aa All triples also added to collections bb and cc
-output_graph /my/graph
-output_collections "bb,cc"
Graph: /my/graph All triples also added to collections bb and cc

Loading Documents from a Forest With Direct Access

Direct Access enables you to extract documents directly from an offline or read-only forest without using MarkLogic Server instance for input. Direct Access is primarily intended for accessing archived data that is part of a tiered storage deployment.

For details, see Importing Documents from a Forest into a Database.

Performance Considerations for Loading Documents

MarkLogic Content Pump comes configured with defaults that should provide good performance under most circumstances. This section presents some performance tradeoffs to consider if you want to try to optimize throughput for your workload.

This section covers the following topics:

Time vs. Space: Configuring Batch and Transaction Size

You can tune the document insertion throughput and memory requirements of your job by configuring the batch size and transaction size of the job.

  • -batch_size controls the number of updates per request to the server.
  • -transaction_size controls the number of requests to the server per transaction.

The default batch size is 100 and the maximum batch size is 200. (However, some options can affect the default). The default transaction size is 1 and the maximum transaction size is 4000/actualBatchSize. This means that the default maximum number of updates per transaction is 1000, and updates per transaction can range from 20 to 4000.

Selecting a batch size is a speed vs. memory tradeoff. Each request to the server introduces overhead because extra work must be done. However, unless you use -streaming or -document_type mixed, all the updates in a batch stay in memory until a request is sent, so larger batches consume more more memory.

Transactions introduce overhead on MarkLogic Server, so performing multiple updates per transaction can improve insertion throughput. However, an open transaction holds locks on fragments with pending updates, potentially increasing lock contention and affecting overall application performance.

It is also possible to overwhelm MarkLogic Server if you have too many concurrent sessions active.

Time vs. Correctness: Understanding -fastload Tradeoffs

The -fastload option can significantly speed up ingestion during import and copy operations, but it can also cause problems if not used properly. This section describes how -fastload affects the behavior of mlcp and some of the tradeoffs associated with enabling it.

The optimizations described by this section are only enabled if you explicitly specify the -fastload or -output_directory options. (The -output_directory option implies -fastload).

The -fastload option work slightly different when used with -restrict_hosts. For details, see How -restrict_hosts Affects -fastload. The limitations of -fastload described in this section still apply.

By default, mlcp inserts documents into the database by distributing work across the e-nodes in your MarkLogic cluster. Each e-node inserts documents into the database according to the configured document assignment policy.

This means the default insertion process for a document is similar to the following:

  1. mlcp selects Host A from the available e-nodes in the cluster and sends it the document to be inserted.
  2. Using the document assignment policy configured for the database, Host A determines the document should be inserted into Forest F on Host B.
  3. Host A sends the document to Host B for insertion.

When you use -fastload (or -output_directory), mlcp attempts to cut out the middle step by applying the document assignment policy on the client. The interaction becomes similar to the following:

  1. Using the document assignment policy, mlcp determines the document should be inserted into Forest F on Host B.
  2. mlcp sends the document to Host B for insertion, with instructions to insert it into a specific forest.

Pre-determining the destination host and forest can always be done safely and consistently if the all of the following conditions are met:

  • Your forest topology is stable.
  • You are creating rather than updating documents.

To make forest assignment decisions locally, mlcp gathers information about the database assignment policy and forest topology at the beginning of a job. If you change the assignment policy or forest topology while an mlcp import or copy operation is running, mlcp might make forest placement decisions inconsistent with those MarkLogic Server would make. This can cause problems such as duplicate document URIs and unbalanced forests.

Similar problems can occur if mlcp attempts to update a document already in the database, and the forest topology or assignment policy changes between the time the document was originally inserted and the time mlcp updates the document. Using user-specified forest placement when initially inserting a document creates the same conflict.

Therefore, it is not safe to enable -fastload optimizations in the following situations:

  • A document mlcp inserts already exists in the database and any of the following conditions are true:
    • The forest topology has changed since the document was originally inserted.
    • The assignment policy has changed since the document was originally inserted.
    • The assignment policy is not Legacy (default) or Bucket. For details, see How Assignment Policy Affects Optimization.
    • The document was originally inserted using user-specified forest placement.
  • A document mlcp inserts does not already exist in the database and any of the following conditions are true:
    • The forest topology changes while mlcp is running.
    • The assignment policy changes while mlcp is running.

Assignment policy is a database configuration setting that affects how MarkLogic Server selects what forest to insert a document into or move a document into during rebalancing. For details, see Rebalancer Document Assignment Policies in Administrator's Guide.

Assignment policy was introduced with MarkLogic 7 and mlcp v1.2. If you use an earlier version of mlcp with MarkLogic 7 or later, the database you import data into with -fastload or -output_directory must be using the legacy assignment policy.

Any operation that changes the forests available for updates changes your forest topology, including the following:

  • Adding or an employing a new forest
  • Removing or retiring an existing forest
  • Changing the updates-allowed state of forest. For example, calling admin:forest-set-updates-allowed
  • Changing the database assignment policy

In most cases, it is your responsibility to determine whether or not you can safely use -fastload (or -output_directory, which implies -fastload). In cases where mlcp can detect -fastload is unsafe, it will disable it or give you an error.

How Assignment Policy Affects Optimization

This section describes how your choice of document assignment policy can introduce additional limitations and risks. Assignment policy is a database configuration setting that affects how MarkLogic Server selects what forest to insert a document into or move a document into during rebalancing. For details, see Rebalancer Document Assignment Policies in Administrator's Guide.

Assignment policy was introduced with MarkLogic 7 and mlcp v1.2. If you use an earlier version of mlcp with MarkLogic 7 or later, the database you import data into with -fastload or -output_directory must be using the legacy assignment policy.

The following table summarizes the limitations imposed by each assignment policy. If you do not explicitly set assignment policy, the default is Legacy or Bucket.

Assignment Policy Notes

Legacy (default)

Bucket

You can safely use -fastload if:
  • there are no pre-existing documents in the database with the same URIs; or
  • you use -output_directory; or
  • the URIs may be in use, but the forest topology has not changed since the documents were created, and the documents were not initially inserted using user-specified forest placement.
Statistical

You can only use -fastload to create new documents; updates are not supported. You should use -output_directory to ensure there are no updates.

All documents in a batch are inserted into the same forest. The rebalancer may subsequently move the documents if the batch size is large enough to cause the forest to become unbalanced.

If you set -fastload to true and mlcp determines database rebalancing is occurring or needs to be done at the start of a job, an error occurs.

Range

You can only use -fastload to create new documents; updates are not supported. You should use -output_directory to ensure there are no updates.

You should use -output_partition to tell mlcp which partition to insert documents into. The partition you specify is used even if it is not the correct partition according to your configured partition policy.

You can only use -fastload optimizations with range policy if you are licensed for Tiered Storage.

If you set -fastload to true and mlcp determines database rebalancing is occurring or needs to be done at the start of a job, an error occurs.

Query

You can only use -fastload to create new documents; updates are not supported. You should use -output_directory to ensure there are no updates.

You should use -output_partition to tell mlcp which partition to insert documents into. The partition you specify is used even if it is not the correct partition according to your configured partition policy.

You can only use -fastload optimizations with range policy if you are licensed for Tiered Storage.

If you set -fastload to true and mlcp determines database rebalancing is occurring or needs to be done at the start of a job, an error occurs.

Tuning Split Size and Thread Count for Local Mode

You can tune split size only when importing documents in local mode from one of the following input file types:

  • Whole documents (-input_file_type documents), whether from flat or compressed files.
  • Composite file types that support -split_input, such as delimited_text.

You cannot tune split size when creating documents from composite files that do not support -split_input, such as sequence files and aggregate XML files.

You can tune thread count for both whole documents and all composite files types. Thread count and split size can interact to affect job performance.

In local mode, a split defines the unit of work per thread devoted to a session with MarkLogic Server. The ideal split size is one that keeps all mlcp session threads busy. The default split size is 32M for local mode. Use the -max_split_size, -thread_count, and -thread_count_per_split options to tune your load.

By default, threads are assigned to splits in a round-robin fashion. For example, consider a loading 120 small documents of length 1M. Since the default split size is 32M, the load is broken into 4 splits. If -thread_count is 10, each split is assigned to at least 2 threads (10 / 4 = 2). The remaining 2 threads are each assigned to a split, so the number of threads per split are distributed as follows:

Split 1: 3 threads
Split 2: 3 threads
Split 3: 2 threads
Split 4: 2 threads

This distribution could result in two of the splits completing faster, leaving some threads idle. If you set -max_split_size to 12M, the load has 10 splits, which can be evenly distributed across the threads and may result in better thread utilization.

Prior to 10.0-4.2, mlcp uses 4 as the default thread count. For mlcp versions equal to or higher than 10.0-4.2, mlcp conducts initial polling to identify the available server threads on the port that handles mlcp requests. Mlcp then uses this value as the default thread count. Users can overwrite the default value by specifying -thread_count in the command line.

If -thread_count is less than the number of splits, the default behavior is one thread per split, up to the total number of threads. The remaining splits must wait until a thread becomes available.

If you specify -thread_count_per_split, each input split will run with the specified number. The total number of thread count, however, is controlled by the newly calculated thread count or -thread_count if it is specified.

If MarkLogic Server is not I/O bound, then raising the thread count, and possibly threads per split, can improve throughput when the number of splits is small but each split is very large. This is often applicable to loading from zip files, aggregate files, and delimited text files. Note that if MarkLogic Server is already I/O bound in your environment, increasing the concurrency of writes will not necessarily improve performance.

Reducing Memory Consumption With Streaming

The streaming protocol allows you to insert a large document into the database without holding the entire document in memory. Streaming uploads documents to MarkLogic Server in 128k chunks.

Streaming content into the database usually requires less memory on the host running mlcp, but ingestion can be slower because it introduces additional network overhead. Streaming also does not take advantage of mlcp's builtin retry mechanism. If an error occurs that is normally retryable, the job will fail.

Streaming is only usable when -input_file_type is documents. You cannot use streaming with delimited text files, sequence files, or archives.

To use streaming, enable the -streaming option. For example:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -username user -password password -host localhost \
    -port 8000 -input_file_path /my/dir -streaming

Improving Throughput with -split_input

If you are loading documents from very large files, you might be able to improve throughput using the -split_input option. When -split_input is true, mlcp attempts to break large input files that would otherwise by processed in a single split into multiple splits. This enables portions of the input file to be loaded by threads (local mode).

This option can only be applied to composite input file types that logically produce multiple documents and for which mlcp can efficiently identify document boundaries, such as delimited_text. Not all composite file types are supported and files containing multi-byte characters must be UTF-8-encoded; for details, see Import Command Line Options.

In local mode, -split_input is false by default.

The -split_input option affects local mode as follows: Suppose you are importing a very large delimited text file in local mode with -split_input set to false and the data processed as a single split. The work might be performed by multiple threads (depending on the job configuration), but these threads read records from the input file synchronously. This can cause some read contention. If you set -split_input to true, then each thread is assigned its own chunk of input, resulting in less contention and greater concurrency.

The number of subdivisions is determined by the formula file-size / max-split-size, so you should also consider tuning split size to match your input data characteristics. For example, if your data consists of 1 delimited text file containing 16M of data, you can observe the following interactions between -split_input and -max_split_size:

Input File Size -split_input Split Size Number of Splits
16M false 32M 1
16M true 32M 1
16M true 1M 16

Tuning the split size in this case potentially enables greater concurrency because the multiple splits can be assigned to different threads or tasks.

Split size is tunable using -max_split_size, -min_split_size, and block size. For details, see Tuning Split Size and Thread Count for Local Mode.

MLCP Concurent Jobs

We do not recommend using concurrent mlcp jobs. Regardless of the version, mlcp doesn't support concurrent jobs if mlcp is importing from/exporting to the same data file. In addition, beginning in 10.0-4.2, each mlcp job uses the maximum number of threads available on the server as the default thread count (more about this can be found in the 10.0-4.2 release notes). Therefore, using concurrent mlcp jobs will not improve performance, as one job is already using full concurrent capacity.

Transforming Content During Ingestion

You can create an XQuery or Server-Side JavaScript function and install it on MarkLogic Server to transform and enrich content before inserting it into the database. Your function runs on MarkLogic Server. You can use such functions with the mlcp import and copy commands.

Creating a Custom XQuery Transformation

The following topics describe how to implement a server-side content transformation function in XQuery:

Function Signature

A custom transformation is an XQuery function module that conforms to the following interface. Your function receives a single input document, described by $content, and can generate zero, one, or many output documents.

declare function yourNamespace:transform(
  $content as map:map,
  $context as map:map)
as map:map*
Input Parameters

The table below describes the input parameters to a transform function:

Parameter Description
$content Data about the original input document. The map contains the following keys:
  • uri - The URI of the document being inserted into the database.
  • value - The contents of the input document, as a document node, binary node, or text node.
$context Additional context information about the insertion, such as tranformation-specific parameter values. The map can contain the following keys when your transform function is invoked:
  • transform_param : The value passed by the client through the -transform_param option, if any. Your function is responsible for parsing and validation.
  • collections : Collection URIs specified by the -output_collections option. Value format: A sequence of strings.
  • permissions : Permissions specified by the -output_permissions option. Value format: A sequence of sec:permission elements, as produced by xdmp:permission.
  • quality : The document quality specified by the -output_quality parameter. Value format: An integer value.
  • temporalCollection : The temporal collection URI specified by the -temporal-collection parameter. Value format: A string.

The type of node your function receives in the value property of $content depends on the input document type, as determined by mlcp from the -document_type option or URI extension. For details, see How mlcp Determines Document Type. The type of node your function returns in the value property should follow the same guidelines.

The table below outlines the relationship between document type and the node type your transform function should expect.

Document Type value node type
XML
document-node
JSON
document-node
BINARY
binary-node
TEXT
text-node

The collections, permissions, quality, and temporal collection metadata from the mlcp command line is made available to your function so that you can modify or replace the values. If a given metadata category is not specified on the command line, the key will not be present in the input map.

Expected Output

Your function can produce more than one output document. For each document, your function should return a map:map. The map:map for an output document must use the same keys as the $content map (uri and value).

Modifying the document URI in a transformation can cause duplicate URIs when combined with the -fastload option, so you should not use -fastload or -output_directory with a transformation module that changes URIs. For details, see Time vs. Correctness: Understanding -fastload Tradeoffs.

The documents returned by your transformation should be exactly as you want to insert them into the database. No further transformations are applied by the mlcp infrastructure. For example, a transform function cannot affect document type just by changing the URI. Instead, it must convert the document node. For details, see Example: Changing the URI and Document Type.

You can use the context parameter to specify collections, permissions, quality, and values metadata for the documents returned by your transform. Use the following keys and data formats for specifying various categories of metadata:

Context Map Key Expected Value Format
collections A sequence of strings containing collection URIs.
permissions A sequence of sec:permission elements, each representing a capability and a role id. For details, see xdmp:permission.
quality An integer value (or a string that can be converted to an integer).
metadata A map:map containing key-value metadata.
temporalCollection A string containing a temporal document collection URI.

For a description of the meaning of the keys, see Input Parameters.

If your function returns multiple documents, they will all share the metadata settings from the context parameter.

Example Implementation

The following example adds an attribute to incoming XML documents and leaves non-XML documents unmodified. The attribute value is specified on the mlcp command line, using the -transform_param option.

declare function example:transform(
  $content as map:map,
  $context as map:map
) as map:map*
{
  let $attr-value := 
   (map:get($context, "transform_param"), "UNDEFINED")[1]
  let $the-doc := map:get($content, "value")
  return
    if (fn:empty($the-doc/element()))
    then $content
    else
      let $root := $the-doc/*
      return (
        map:put($content, "value",
          document {
            $root/preceding-sibling::node(),
            element {fn:name($root)} {
              attribute { fn:QName("", "NEWATTR") } {$attr-value},
              $root/@*,
              $root/node()
            },
            $root/following-sibling::node()
          }
        ), $content
      )
};

For an end-to-end example of using this transform, see Example: Server-Side Content Transformation.

Creating a Custom JavaScript Transformation

The following topics describe how to implement a server-side content transformation function in Server-Side JavaScript:

Function Signature

A custom transformation is a JavaScript function module that conforms to the following interface. Your function receives a single input document, described by $content, and can generate zero, one, or many output documents.

function yourTransform(content, context)
Input Parameters

The content parameter is an object containing data about the original input document. The content parameter has the following form:

{ uri: string,
  value: node
}

The type of node your function receives in content.value depends on the input document type, as determined by mlcp from the -document_type option or URI extension. For details, see How mlcp Determines Document Type. The type of node your function returns in the value property should follow the same guidelines.

The table below outlines the relationship between document type and the node type your transform function should expect (or return).

Document Type value node type
XML
document-node
JSON
document-node
BINARY
binary-node
TEXT
text-node

The context parameter can contain context information about the insertion, such as any transform-specific parameters passed on the mlcp command line. The context parameter has the following form:

{ transform_param: string,
  collections: [ string, ... ],
  permissions: [ object, ... ],
  quality: number,
  temporalCollection: string}

The following table describes the properties of the input parameters in more detail:

Parameter Description
content
  • uri - The URI of the document being inserted into the database.
  • value - The contents of the input document, as a document node, binary node, or text node; see below.
context
  • transform_param - The value passed by the client through the -transform_param option, if any. Your function is responsible for parsing and validation of the input string.
  • collections : Collection URIs specified by the -output_collections option. Value format: An array of strings.
  • permissions : Permissions specified by the -output_permissions option. Value format: An array of permissions objects, as produced by xdmp.permission.
  • quality : The document quality specified by the -output_quality parameter. Value format: A number.
  • temporalCollection : The temporal collection URI specified by the -temporal-collection parameter. Value format: A string.

The collections, permissions, quality, and temporal collection metadata from the mlcp command line is made available to your function so that you can modify or replace the values. If a given metadata category is not specified on the command line, the property will not be present in the context object.

Expected Output

Your function can produce more than one output document. For each document, your function should return a JavaScript object containing the same properties as the content input parameter (uri and value). When returning multiple document objects, put them in a Sequence.

The document content returned by your transformation should be exactly as you want to insert them into the database. No further transformations are applied by the mlcp infrastructure. For example, a transform function cannot affect document type just by changing the URI. Instead, it must convert the document node. For details, see Example: Changing the URI and Document Type.

You can modify the context input parameter to specify collections, permissions, quality, and values metadata for the documents returned by your transform. Use the following property names and data formats for specifying various categories of metadata:

Context Property Expected Value Format
collections An array of strings, each repesenting a collection URIs.
permissions An array of permission objects, each containing a capability and a roleId property. For details, see xdmp:permission.
quality An integer value (or a string that can be converted to an integer).
metadata An object where each property represents a key-value metadata item.
temporalCollection A string containing a temporal document collection URI.

For a description of the meaning of the keys, see Input Parameters.

If your function returns multiple documents, they will all share the metadata settings from the context parameter.

Example Implementation

The following example adds a property named NEWPROP to incoming JSON documents and leaves non-JSON documents unmodified. The property value is specified on the mlcp command line, using the -transform_param option.

// Add a property named "NEWPROP" to any JSON input document.
// Otherwise, input passes through unchanged.

function addProp(content, context)
{
  const propVal = (context.transform_param == undefined)
                 ? "UNDEFINED" : context.transform_param;

  if (xdmp.nodeKind(content.value) == 'document' &&
      content.value.documentFormat == 'JSON') {
    // Convert input to mutable object and add new property
    const newDoc = content.value.toObject();
    newDoc.NEWPROP = propVal;

    // Convert result back into a document
    content.value = xdmp.unquote(xdmp.quote(newDoc));
  }
  return content;
};

exports.addProp = addProp;

Implementation Guidelines

You should be aware of the following guidelines and limitations when implementing your transformation function:

  • If you use a server-side transform with -fastload (or -output_directory, which enables -fastload), your transformation function only has access to database content in the same forest as the input document. If your transformation function needs general access to the database, do not use -fastload or -output_directory.

Installing a Custom Transformation

Install the XQuery library module containing your function into the modules database or modules root directory of the XDBC App Server associated with the destination database. For import operations, this is the App Server identified by -host and -port mlcp command line options. For copy operations, this is the App Server identified by -output_host and -output_port mlcp command line options.

Best practice is to install your libraries into the modules database of your XDBC App Server. If you install your module into the modules database, MarkLogic Server automatically makes the implementation available throughout your MarkLogic Server cluster. If you choose to install dependent libraries into the Modules directory of your MarkLogic Server installation, you must manually do so on each node in your cluster.

MarkLogic Server supports several methods for loading modules into the modules database:

If you use the filesystem instead of a modules database, you can manually install your module into the Modules directory. Copy the module into MARKLOGIC_INSTALL_DIR/Modules or into a subdirectory of this directory. The default location of this directory is:

  • Unix: /opt/MarkLogic/Modules
  • Windows: C:\Program Files\MarkLogic\Modules

If your transformation function requires other modules, you should also install the dependent libraries in the modules database or the modules directory.

For a complete example, see Example: Server-Side Content Transformation.

Using a Custom Transformation

Once you install a custom transformation function on MarkLogic Server, you can apply it to your mlcp import or copy job using the following options:

  • -transform_module - The path to the module containing your transformation. Required.
  • -transform_namespace - The namespace of your transformation function. If omitted, no namespace is assumed.
  • -transform_function - The local name of your transformation function. If omitted, the name transform is assumed.
  • -transform_param - Optional additional string data to be passed through to your transformation function.

Take note of the following limitations:

  • When -fastload is in effect, your transform function runs in the scope of a single forest (the forest mlcp determines is the appropriate destination for the file being inserted). This means if you change the document URI as part of your transform, you can end up creating documents with duplicate URIs.
  • When you use a transform function, all the documents in each batch are transformed and inserted into the database as a single statement. This means, for example, that if the (transformed) batch contain more than one document with the same URI, you will get an XDMP-CONFLICTINGUPDATES error.

The following example command assumes you previously installed a transform module with path /example/mlcp-transform.xqy, and that the function implements a transform function (the default function) in the namespace http://marklogic.com/example. The function expects a user-defined parameter value, supplied using the -transform_param option.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -mode local -host mlhost -port 8000 \
    -username user -password password \
    -input_file_path /space/mlcp-test/data \
    -transform_module /example/mlcp-transform.xqy \
    -transform_namespace "http://marklogic.com/example" \
    -transform_param "my-value"

For a complete example, see Example: Server-Side Content Transformation.

Example: Server-Side Content Transformation

This example walks you through installing and using an XQuery or Server-Side JavaScript transform function to modify content ingested with mlcp. The example XQuery transform function modifies XML documents by adding an attribute named NEWATTR, with an attribute value specified on the mlcp command line. The example JavaScript transform function modifies JSON documents by adding a new property named NEWPROP, with a value specified on the mlcp command line.

This example assumes you have already created an XDBC App Server, configured to use "/" as the root and a modules database of Modules.

  1. Create the sample input files
  2. Create the XQuery transform module
  3. Create the JavaScript transform module
  4. Install the transformation module
  5. Apply the transformation
Create the sample input files

This section walks you through creating sample input data to be ingested by mlcp. You can use other data.

  1. Create a directory to hold the sample input data. For example:
    $ mkdir /space/mlcp/txform/data
  2. Create a file named txform.xml in the sample data directory with the following contents:
    <parent><child/></parent>
  3. Create a file named txform.json in the sample data directory with the following contents:
    { "key": "value" }
Create the XQuery transform module

If you prefer to work with a Server-Side JavaScript transform function, skip this section and go to Create the JavaScript transform module.

This example module modifies XML input documents by adding an attribute named NEWATTR. Other input document types pass through the transform unmodified.

In a location other than the sample input data directory, create a file named transform.xqy with the following contents. For example, copy the following into /space/mlcp/txform/transform.xqy.

xquery version "1.0-ml";
module namespace example = "http://marklogic.com/example";

(: If the input document is XML, insert @NEWATTR, with the value
 : specified in the input parameter. If the input document is not
 : XML, leave it as-is.
 :)
declare function example:transform(
  $content as map:map,
  $context as map:map
) as map:map*
{
  let $attr-value := 
    (map:get($context, "transform_param"), "UNDEFINED")[1]
  let $the-doc := map:get($content, "value")
  return
    if (fn:empty($the-doc/element()))
    then $content
    else
      let $root := $the-doc/*
      return (
        map:put($content, "value",
          document {
            $root/preceding-sibling::node(),
            element {fn:name($root)} {
              attribute { fn:QName("", "NEWATTR") } {$attr-value},
              $root/@*,
              $root/node()
            },
            $root/following-sibling::node()
          }
        ), $content
      )
};
Create the JavaScript transform module

If you prefer to work with an XQuery transform function, skip this section and go to Create the XQuery transform module.

This example module modifies JSON input documents by adding an attribute named NEWPROP. Other input document types pass through the transform unmodified.

In a location other than the sample input data directory, create a file named transform.sjs with the following contents. For example, copy the following into /space/mlcp/txform/transform.sjs.

// Add a property named "NEWPROP" to any JSON input document.
// Otherwise, input passes through unchanged.

function addProp(content, context)
{
  var propVal = (context.transform_param == undefined)
                 ? "UNDEFINED" : context.transform_param;

  var docType = xdmp.nodeKind(content.value);
  if (xdmp.nodeKind(content.value) == 'document' &&
      content.value.documentFormat == 'JSON') {
    // Convert input to mutable object and add new property
    var newDoc = content.value.toObject();
    newDoc.NEWPROP = propVal;

    // Convert result back into a document
    content.value = xdmp.unquote(xdmp.quote(newDoc));
  }
  return content;
};

exports.transform = addProp;
Install the transformation module

This section walks you through installing the transform module(s) created in Create the XQuery transform module or Create the JavaScript transform module.

These instructions assume you use the XDBC App Server and Documents database pre-configured on port 8000. This procedure installs the module using Query Console. You can use another method.

For more detailed instructions on using Query Console, see Query Console User Guide.

  1. Navigate to Query Console in your browser:
    http://yourhost:8000/qconsole/
  2. Create a new query by clicking the "+" at the top of the query editor.
  3. Select XQuery in the Query Type dropdown.
  4. Install the XQuery and/or JavaScript module by copying one of the following scripts into the new query. Modify the first parameter of xdmp:document-load to match the path to the transform module you previously created.
    1. To install the XQuery module, use the following script:
      xquery version "1.0-ml";
      xdmp:document-load("/space/mlcp/txform/transform.xqy",
          <options xmlns="xdmp:document-load">
            <uri>/example/mlcp-transform.xqy</uri>
            <repair>none</repair>
            <permissions>{xdmp:default-permissions()}</permissions>
          </options>)
    2. To install the JavaScript module, use the following script:.
      xquery version "1.0-ml";
      xdmp:document-load("/space/mlcp/txform/transform.sjs",
          <options xmlns="xdmp:document-load">
            <uri>/example/mlcp-transform.sjs</uri>
            <repair>none</repair>
            <permissions>{xdmp:default-permissions()}</permissions>
          </options>)
  5. Select the modules database of your XDBC App Server in the Content Source dropdown at the top of the query editor. If you use the XDBC App Server on port 8000, this is the database named Modules.
  6. Click the Run button. Your module is installed in the modules database.
  7. To confirm installation of your module, click the Explore button at the top of the query editor and note your module installed with URI /example/mlcp-transform.xqy or /example/mlcp-transform.sjs.
Apply the transformation

To ingest the sample documents and apply the previously installed transformation, use a command similar to the following. Change the username, password, host, port, and input_file_path options to match your environment.

Use a command similar to the following if you installed the XQuery transform module:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -mode local -host mlhost -port 8000 \
    -username user -password password \
    -input_file_path /space/mlcp/txform/data \
    -transform_module /example/mlcp-transform.xqy \
    -transform_namespace "http://marklogic.com/example" \
    -transform_param "my-value"

Use a command similar to the following if you installed the JavaScript transform module:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -mode local -host mlhost -port 8000 \
    -username user -password password \
    -input_file_path /space/mlcp/txform/data \
    -transform_module /example/mlcp-transform.sjs \
    -transform_function transform \
    -transform_param "my-value"

mlcp should report creating two documents. Near the end of the mlcp output, you should see lines similar to the following:

... INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 2
... INFO contentpump.LocalJobRunner: Total execution time: 1 sec

Use Query Console to explore the content database associated with your XDBC App Server. Confirm that mlcp created 2 documents. If your input was in the directory /space/mlcp/txform/data, then the document URIs will be:

  • /space/mlcp/txform/data/txform.xml
  • /space/mlcp/txform/data/txform.json

If you use the XQuery transform, then exploring the contents of txform.xml in the database should show a NEWATTR attribute was inserted by the transform, with the value from -transform_param. The document contents should be as follows:

<parent NEWATTR="my-value">
  <child/>
</parent>

If you use the JavaScript transform, then exploring the contents of txform.json in the database should show a NEWPROP property was inserted by the transform, with the value from -transform_param. The document contents should be as follows:

{ "key": "value", "NEWPROP": "my-value"}

Example: Changing the URI and Document Type

This example demonstrates changing the type of a document from binary to XML and changing the document URI to match.

Transforms that change the document URI should not be combined with the -fastload or -output_directory options as they can cause duplicate document URIs. For details, see Time vs. Correctness: Understanding -fastload Tradeoffs.

As described in How mlcp Determines Document Type, the URI extension and MIME type mapping are used to determine document type when you use -document_type mixed. However, transform functions do not run until after document type selection is completed. Therefore, if you want to affect document type in a transform, you must convert the document node, as well as optionally changing the output URI.

Suppose your input document set generates an output document URI with the unmapped extension .1, such as /path/doc.1. Since 1 is not a recognized URI extension, mlcp creates a binary document node from this input file by default. The example transform function in this section intercepts such a document and transforms it into an XML document.

Note that if you define a MIME type mapping that maps the extension 1 to XML (or JSON) in your MarkLogic Server configuration, then mlcp creates a document of the appropriate type to begin with, and this conversion becomes unnecessary.

XQuery Implementation

This module detects input documents with URI suffixes of the form .1 and converts them into XML documents with a .xml URI extension. Note that the transform does not snoop the content to ensure it is actually XML.

xquery version "1.0-ml";
module namespace example = "http://marklogic.com/example";

declare function example:mod_doc_type(
  $content as map:map,
  $context as map:map
) as map:map*
{
  let $orig-uri := map:get($content, "uri")
  return
  if (fn:substring-after($orig-uri, ".") = "1") then
    let $doc-type := xdmp:node-kind(map:get($content, "value"))
    return (
      (: change the URI to an xml suffix :)
      map:put($content, "uri", 
        fn:concat(fn:substring-before($orig-uri, "."), ".xml")
      ),
      (: convert the input from binary node to xml document node :)
      if ($doc-type = "binary") then
        map:put(
          $content, "value",
          xdmp:unquote(xdmp:quote(map:get($content, "value")))
        )
      else (),
      $content
    )
  else $content
};
JavaScript Implementation

This module detects input documents with URI suffixes of the form .1 and converts them into JSON documents with a .json URI extension. Note that the transform does not snoop the content to ensure it is actually JSON.

function modDocType(content, context)
{
  var uri = String(content.uri);
  var dot = uri.lastIndexOf('.');
  if (dot > 0) {
    var suffix = uri.slice(dot);
    if (suffix == '.1') {
      content.uri = uri.substring(0,dot+1) + 'json';
      if (xdmp.nodeKind(content.value) == 'binary') {
        // convert the content to a JSON document
        content.value = xdmp.unquote(xdmp.quote(content.value));
      }
    }
  }
  return content;
};

exports.transform = modDocType;

Controlling How mlcp Connects to MarkLogic

This section describes how mlcp connects to MarkLogic by default, and options you can use to modify this behavior. For example, you can force mlcp to only connect to MarkLogic through a load balancer host.

See the following topics for more details:

How mlcp Uses the Host List

You must specify at least one host with -host command line option. You can specify multiple hosts.

If any hostname listed in the value of the -host option is not resolvable by mlcp at the beginning of a job, then mlcp will abort the job with an IllegalArgumentException.

Assuming all hostnames are resolvable, mlcp uses the first of these hosts to gather information about the target database. If mlcp is unable to connect to the first host in the -host list, then mlcp will move on to the next host in the list. If mlcp cannot connect to any of the listed hosts, then the job will fail with an IOException.

If mlcp successfully retrieves a list of forest hosts, then mlcp subsequently connects directly to these hosts when distributing work across the cluster, whether or not these hosts are specified in the -host option. In this way, your job does not need to be aware cluster topology.

This behavior applies to the import, export, and copy commands. (For a copy job, you specify hosts through -input_host and -output_host, rather than -host.)

You can also restrict mlcp to just the hosts listed by the -host option. For details, see Restricting the Hosts mlcp Uses to Connect to MarkLogic.

Restricting the Hosts mlcp Uses to Connect to MarkLogic

You can restrict the hosts to which mlcp distributes work using the -restrict_hosts and -host command line options. You might find this option combination useful in situations such as the following:

  • Limit the host working set to just the e-nodes in your cluster.
  • The public and private DNS names of a host differ, such as can occur for an AWS instance.

    MarkLogic automatically sets -restrict_hosts to true when it detects the presence of a load balancer.

When -restrict_hosts is set to true, mlcp will only connect to the hosts listed in the -host option, rather than using the approach described in How mlcp Uses the Host List.

Using -restrict_hosts will usually degrade the performance of an mlcp job because mlcp cannot distribute work as efficiently.

For example, if you're using mlcp with a load balancer between your client and your MarkLogic cluster, you can specify the load balancer with -host and set -restrict_hosts to true to prevent mlcp from attempting to bypass the load balancer and connect directly to the forest hosts.

You can restrict mlcp's host list when using the import, export, and copy commands. For import and export, use the -host and -restrict_hosts options. For copy, use -input_host and -restrict_input_hosts and/or -output_host and -restrict_output_hosts.

How -restrict_hosts Affects -fastload

You can use -fastload with -restrict_hosts. The performance improvement from -fastload will be less than if you did not use -restrict_hosts, but better than if you do not use -fastload. The usual cautions about -fastload apply; see Time vs. Correctness: Understanding -fastload Tradeoffs.

The -fastload and -restrict_hosts options interact as follows:

Without -restrict_hosts, mlcp figures out which hosts contains the destination forest for a document, and then connects directly to that host. When -restrict_hosts is true, a connection to the forest host might not possible. In this case, mlcp connects to an allowed e-node, and includes the detailed destination information along with the document. The destination details makes an insertion faster than it would otherwise be.

Failover Handling

Failover occurs when a forest or a host in a cluster becomes unavailable, due to events such as a forest restart or a host becoming unreachable. You can configure a database to use local or shared disk failover to attempt automatic recovery; for details see High Availability of Data Nodes With Failover in the Scalability, Availability, and Failover Guide.

Failover support in mlcp is only available when running mlcp against MarkLogic 9 or later. With older MarkLogic versions, the job will fail if mlcp is connected to a host that becomes unavailable.

mlcp always attempts to connect to a new host during a failover event. mlcp can potentially recover from failover event in the following cases:

  • If mlcp receives a connection error that indicates an e-node serving the database is down, mlcp attempts to select another host. For a job that is not running in fastload mode, mlcp selects the next host in its host list. For a fastload job, mlcp attempts to determine the replica forest and host and connect to that host.
  • If mlcp receives a retryable error from MarkLogic, it will retry the operation with the same host. For example, a forest restart or a forest replica host going down can cause a retryable error.

If mlcp is able to re-establish a connection in these cases, then the job can continue. It is possible for some documents not to be imported, depending on the configuration of the job. mlcp can only retry the current batch.

  • If -transaction_size is 1, then mlcp only needs to retry the current batch. In most cases, a successful failover will not cause any insertions to fail.
  • If -transaction_size is greater than 1, then mlcp can only retry the current batch. Other batches in the same transaction cannot be retried. Some documents might not be inserted.
  • Even if -transaction_size is 1, mlcp might fail to import all documents in the face of a failover event in some cases. For example:
    • Failover does not succeed within 5 minutes. If it takes more than 5 minutes for MarkLogic to recover from the failure, then mlcp aborts the job and reports an error.

mlcp reports any documents that could not be inserted due to the failover.

The following messages are an example of mlcp output during a failover event. Timestamps have been elided.

  1. A failure of some kind occurs, such as host going down. The exact error messages will depend on the type of failure. Notice that example errors below include a retryable exception.
    ...INFO contentpump.LocalJobRunner: completed 41%
    ...WARNING [29] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''
    ...WARN mapreduce.ContentWriter: Batch 981349710.122: Exception:Error parsing HTTP headers: Premature EOF, partial header line read: ''
    ...WARNING [29] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''
    ...WARN mapreduce.ContentWriter: Batch 981349710.122: Failed rolling back transaction Error parsing HTTP headers: Premature EOF, partial header line read: ''
    ...WARNING [29] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''
    ...ERROR mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:XDMP-XDQPDISC: XDQP connection disconnected, server=somehost
    ...ERROR mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:XDMP-XDQPDISC: XDQP connection disconnected, server=somehost
    ...ERROR mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:XDMP-XDQPDISC: XDQP connection disconnected, server=somehost
  2. mlcp begins retrying the failed insertion. Errors may continue to occur because MarkLogic is still failing over.
    ...INFO mapreduce.ContentWriter: Batch 981349710.122: Retrying document insert
    ...WARN  mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused
    ...INFO mapreduce.ContentWriter: Batch 981349710.122: Retrying document insert
    ...INFO mapreduce.ContentWriter: Batch 981349710.122: Retrying document insert
    ...INFO mapreduce.ContentWriter: Batch 981349710.122: Retrying document insert
    ...WARN mapreduce.ContentWriter: Batch 981349710.122: Exception:Connection refused
    ...WARN mapreduce.ContentWriter: Batch 981349710.122: Exception:Connection refused
    ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused
    ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused
    ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused
    ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused
  3. Eventually, MarkLogic recovers and the job continues normally.

MLCP Retry Mechanism When Commit Fails During Ingestion

When mlcp is used to ingest content into Data Hub Service (DHS), it frequently catches exceptions when the static e-node gets overloaded, or if the dynamic e-nodes are unavailable, as they come and go.

Before 10.0-5, when an mlcp commit failed during ingestion, due to the exceptions listed above, mlcp did not retry the batch. All the documents in the current batch would fail permanently. The mlcp retry mechanism has been added in 10.0-5 to make mlcp more robust and able to recover from these exceptions.

There are three circumstances that need to be considered:

  • If -batch_size is 1 and -transaction_size is 1: mlcp uses AUTO transaction mode. Transactions automatically commit and rollback. mlcp will retry inserting the whole batch when it catches exceptions during commit.
  • If -batch_size is larger than 1 and -transaction_size is 1: mlcp will use UPDATE transaction mode, and explicitly commits and rolls back. mlcp will retry loading the whole batch if the exceptions caught during commit are retryable. mlcp will retry when commit fails maximum 15 times. Between each retry, it sleeps for a certain amount of time. The interval varies from 0.5 seconds to 2 minutes, and it doubles every time MLCP retries. The total maximum sleep time sums up to ~16 minutes, which is tuned to wait for dynamic e-nodes to come up. In most cases, a successful retry will not cause any insertions to fail.
  • If -batch_size is larger than 1 and -transaction_size is larger than 1: mlcp does not retry in this situation as the client only caches the current batch. All the documents in the current transaction will fail permanently.

mlcp only retries when the exceptions caught are retryable. Every time when mlcp retries, it attempts to select another host. When the exceptions are not retryable, or the retry doesn't succeed within ~16 minutes for the DHS cluster to recover, all the documents in the current batch will fail permanently and mlcp will log the failure.

When the current batch fails during inserting or committing, the failures will be logged on WARN level. Then if the exception is retryable, mlcp will retry inserting the whole batch, and the retry messages will be logged on DEBUG level. If the retry succeeds, the succeeding message will be logged on INFO level. If the exception is not retryable, or the maximum retry limit has been exceeded, the document/batch will fail permanently and will be logged on ERROR level.

Each log message has a batch number in the format of xxxx.xxxx (two integers separated by a dot) attached to it. The first integer represents the current thread number and the second represents the batch count local to the current thread. Globally, xxxx.xxxx is unique. This batch number makes it easier to track down and debug batch failures.

The following messages are an example of common exceptions caught when running mlcp with DHS cluster on AWS/Azure. These exceptions mostly happens when e-nodes are down or the static e-node gets overloaded. Timestamps have been removed from these examples.

...WARN contentpump.TransformWriter: Batch #88895712.638: Failed committing transaction: Error parsing HTTP headers: Premature EOF, partial header line read: ''
...WARN mapreduce.ContentWriter: Batch #88895712.638: QueryException:XDMP-XDQPDISC: XDQP connection disconnected, server=somehost
...WARN contentpump.TransformWriter: Batch #1520482927.642: Failed committing transaction: Server cannot accept request: Service Unavailable -- Stopping by SIGTERM from pid 3121
...WARN mapreduce.ContentWriter: Batch #1520482927.642: com.marklogic.xcc.exceptions.XQueryException: XDMP-NOTXN: No transaction with identifier 11132444146034518336
[Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=5bJZEjQ1L.z.marklogicsvc.com/52.224.204.231:8005, pool=0/64]]]
[Client: XCC/11.0-20200911, Server: XDBC/10.0-4]

mlcp gets XDMP-NOTXN when the transaction has already been committed or rolled back.

The following messages are an example of MLCP output during a retry event. Timestamps have been removed.

...WARN contentpump.TransformWriter: Batch 1473219859.1010: Exception:Server cannot accept request: Gateway Time-out
...WARN contentpump.TransformWriter: Batch 1473219859.1010: Failed during inserting
...DEBUG mapreduce.ContentWriter: Batch 1473219859.1010: Sleeping before retrying...sleepTime=500ms
...DEBUG contentpump.TransformWriter: Batch 1473219859.1010: Retrying inserting batch, attempts: 1/15
...INFO contentpump.TransformWriter: Batch 1473219859.1010: Retrying inserting batch is successful
...WARN contentpump.TransformWriter: Batch 278973739.75: Failed committing transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 918057596.3: Failed committing transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 278973739.75: Failed during committing
...WARN contentpump.TransformWriter: Batch 918057596.3: Failed during committing
...WARN contentpump.TransformWriter: Batch 1763434846.80: Failed committing transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 1763434846.80: Failed during committing
...WARN contentpump.TransformWriter: Batch 981349710.122: Failed committing transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 981349710.122: Failed during committing
...WARN mapreduce.ContentWriter: Batch 278973739.75: Failed rolling back transaction: No transaction
...DEBUG mapreduce.ContentWriter: com.marklogic.xcc.exceptions.XQueryException: XDMP-NOTXN: No transaction with identifier 11132444146034518336
[Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=5bJZEjQ1L.z.marklogicsvc.com/52.224.204.231:8005, pool=0/64]]]
[Client: XCC/11.0-20200911, Server: XDBC/10.0-4]
...DEBUG mapreduce.ContentWriter: Batch 278973739.75: Sleeping before retrying...sleepTime=500ms
...WARN contentpump.TransformWriter: Batch 1978594827.298: QueryException: JS-FATAL: xdmp:function(fn:QName(, transformInsertBatch), /MarkLogic/hadoop.sjs)($transform-module, $transform-function, $uris, $values, $insert-options, $transform-option) 
...WARN contentpump.TransformWriter: Batch 1978594827.298: Failed during inserting
...ERROR contentpump.TransformWriter: Batch 1978594827.298: Document failed permanently: /space/data/iplocations/IP2LOCATION-LITE-DB5.CSV.gz-0-2798613 in file:/space/data/iplocations/IP2LOCATION-LITE-DB5.CSV.gz at line 2798614
Limitations

There are two known limitations with the mlcp retry feature:

  • When the input type is archive, mlcp is not able to retry loading metadata/naked properties when commit fails, since by design the client does not cache these inputs.
  • Loading temporal documents may have issues. When mlcp commit fails and catches exceptions, it tries rolling back before retry loading the whole batch. However, the previous transaction may have made it to the server and mlcp will get NOTXN exception. This may create issues for temporal documents, since they may be inserted multiple times.

MLCP Auto-scaling with Data Hub Service

Before 10.0-6, mlcp import jobs ran with a fixed number of threads until completion. After 10.0-6, mlcp reactive auto-scaling capability for import jobs is enabled when running against Data Hub Service (DHS) hosted on AWS/Azure. The concurrency of mlcp now adjusts periodically based on the available server threads as the dynamic e-nodes come and go in DHS. This feature improves mlcp performance by leveraging the scaling feature of DHS.

How MLCP Adjusts Client Concurrency

When running an import job, mlcp periodically send polling requests to the server through the XCC layer to obtain the maximum server threads. When the DHS cluster adds more dynamic e-nodes, server has more available concurrency. Then mlcp decides whether to scale-out or scale-in, its own thread pool based on the result.

The following command line options can be used to tune this process:

  • -max_thread_percentage: The percentage (between 0 and 100) of maximum available server threads mlcp will use to run import jobs.
  • -polling_period: The time interval (in minutes) mlcp sends the polling requests to the server.
  • -polling_init_delay: The initial delay (in minutes) before mlcp starts sending the polling requests.

How Other Command Line Options Affect Auto-scaling

The following existing command line options also affect the auto-scaling feature:

  • -thread_count and -thread_count_per_split: When these two options are specified, mlcp will use a fixed number of threads and auto-scaling will not happen.
  • -max_threads: When -max_threads is specified, mlcp will cap the maximum thread count, and auto-scaling cannot go beyond this number. This is to prevent the client-side from running out of memory as the DHS cluster may have a huge number of nodes. By default, -max_threads is not set.

How MLCP Assigns Threads in Auto-Scaling Process

When mlcp scales-out or scales-in, new threads are assigned to or removed from the existing input splits using round-robin fashion, same as the logic discussed in Tuning Split Size and Thread Count for Local Mode.

MLCP Logs for Auto-Scaling

When mlcp scales-out or scales-in, there will be a log message on INFO level to notify user about the scaling process. If the thread count has reached the maximum value, it will also be logged on INFO level. For every periodic polling, mlcp will log new available server threads on DEBUG level. If mlcp decides to scale-out or scale-in, the assigned or deducted threads for each input split will also be logged on DEBUG level.

The following messages are an example of common log messages a user may get in an auto-scaling process. Timestamps have been removed.

DEBUG contentpump.ThreadManager: Initial thread pool size: 32
DEBUG contentpump.ThreadManager: Thread pool will auto-scale based on available server threads.
DEBUG contentpump.ThreadManager: Running with MultithreadedMapper. Initial thread count for split #0: 11
DEBUG contentpump.ThreadManager: Running with MultithreadedMapper. Initial thread count for split #1: 11
DEBUG contentpump.ThreadManager: Running with MultithreadedMapper. Initial thread count for split #2: 10
INFO contentpump.LocalJobRunner:  completed 0%
DEBUG contentpump.ThreadManager: New available server threads: 32
DEBUG contentpump.ThreadManager: New available server threads: 32
DEBUG contentpump.ThreadManager: New available server threads: 16
INFO contentpump.ThreadManager: Thread pool is scaling-in. New thread pool size: 16
DEBUG contentpump.ThreadManager: Running with MultithreadedMapper. New thread count for split #0: 6
DEBUG contentpump.ThreadManager: Running with MultithreadedMapper. New thread count for split #1: 5
DEBUG contentpump.ThreadManager: Running with MultithreadedMapper. New thread count for split #2: 5
DEBUG contentpump.ThreadManager: New available server threads: 16

Import Command Line Options

This section summarizes the command line options available with the mlcp import command. The following command line options define your connection to MarkLogic:

Option Description
-host comma-list
Required. A comma separated list of hosts through which mlcp can connect to the destination MarkLogic Server. You must specify at least one host. For more details, see How mlcp Uses the Host List.
-port number
Port number of the destination MarkLogic Server. There should be an XDBC App Server on this port. Default: 8000.
-username string
MarkLogic Server user with which to import documents. Required, unless using Kerberos authentication.
-password string
Password for the MarkLogic Server user specified with -username. Required, unless using Kerberos authentication.

The following table lists command line options that define the characteristics of the import operation:

Option Description
-aggregate_record_element string
When splitting an aggregate input file into multiple documents, the name of the element to use as the output document root. Default: The first child element under the root element.
-aggregate_record_namespace string
The namespace of the element specified by -aggregate_record_element_name. Default: No namespace.
-aggregate_uri_id string
Deprecated. Use -uri_id instead. When splitting an aggregate input file into multiple documents, the element or attribute name within the document root to use as the document URI. Default: In local mode, hashcode-seqnum, where the hashcode is derived from the split number; in distribute mode, taskid-seqnum.
-batch_size number
The number of documents to process in a single request to MarkLogic Server. Default: 100. Maximum: 200.
-collection_filter comma-list
A comma-separated list of collection URIs. Only usable with -input_file_type forest. mlcp extracts only documents in these collections. This option can be combined with other filter options. Default: Import all documents.
-content_encoding string
The character encoding of input documents when -input_file_type is documents, aggregates, delimited_text, or rdf. The option value must be a character set name accepted by your JVM; see java.nio.charset.Charset. Default: UTF-8. Set to system to use the platform default encoding for the host on which mlcp runs.
-copy_collections boolean
When importing documents from an archive, whether to copy document collections from the source archive to the destination. Only applies when -input_file_type is archive or forest. Default: true.
-copy_metadata boolean
When importing documents from an archive, whether to copy document key-value metadata from the source archive to the destination. Only applies when -input_file_type is archive or forest. Default: true.
-copy_permissions boolean
When importing documents from an archive, whether to copy document permissions from the source archive to the destination. Only applies with -input_file_type archive. Default: true.
-copy_properties boolean
When importing documents from an archive, whether to copy document properties from the source archive to the destination. Only applies with -input_file_type archive. Default: true.
-copy_quality boolean
When importing documents from an archive, whether to copy document quality from the source archive to the destination. Only applies when -input_file_type is archive or forest. Default: true.
-data_type comma-list
When importing content with -input_file_type delimited_text and -document_type json, use this option to specify the data type (string, number, or boolean) to give to specific fields. The option value must be a comma separated list of name,datatype pairs, such as a,number,b,boolean. Default: All fields have string type. For details, see Controlling Data Type in JSON Output.
-database string
The name of the destination database. Default: The database associated with the destination App Server identified by -host and -port.
-delimiter character
When importing content with -input_file_type delimited_text, the delimiting character. Default: comma (,).
-delimited_root_name string
When importing content with -input_file_type delimited_text, the local name of the document root element. Default: root.
-delimited_uri_id string

Deprecated. use -uri_id instead.

When importing content -input_file_type delimited_text, the column name that contributes to the id portion of the URI for inserted documents. Default: The first column.

-directory_filter comma-list
A comma-separated list of database directory names. Only usable with -input_file_type forest. mlcp extracts only documents from these directories, plus related metadata. Directory names should usually end with /. This option can be combined with other filter options. Default: Import all documents.
-document_type string
The type of document to create when -input_file_type is documents, sequencefile or delimited_text. Accepted values: mixed (documents only), xml, json, text, binary. Default: mixed for documents, xml for sequencefile, and xml for delimited_text.
-fastload boolean
Whether or not to force optimal performance, even at the risk of creating duplicate document URIs. See Time vs. Correctness: Understanding -fastload Tradeoffs. Default: false.
-filename_as_collection boolean
Add each loaded document to a collection corresponding to the name of the input file. You cannot use this option when -input_file_type is rdf or forest. Useful when splitting an input file into multiple documents. If the filename contains characters not permitted in a URI, those characters are URI encoded. Default: false.
-generate_uri boolean
When importing content with -input_file_type delimited_text, or -input_file_type delimited_json, whether or not MarkLogic Server should automatically generate document URIs. Default: false for delimited_text, true for delimited_json. For details, see Default Document URI Construction.
-archive_metadata_optional boolean
When importing documents from a database archive, whether or not to ignore missing metadata files. If this is false and the archive contains no metadata, an error occurs. Default: false.
-input_compressed boolean
Whether or not the source data is compressed. Default: false.
-input_compression_codec string
When -input_compressed is true, the code used for compression. Accepted values: zip, gzip.
-input_file_path string
A regular expression describing the filesystem location(s) to use for input. For details, see Regular Expression Syntax.
-input_file_pattern string
Load only input files that match this regular expression from the path(s) matched by -input_file_path. For details, see Regular Expression Syntax. Default: Load all files. This option is ignored when -input_file_type is forest.
-input_file_type type
The input file type. Accepted value: aggregates, archive, delimited_text, delimited_json, documents, forest, rdf, sequencefile. Default: documents.
-max_split_size number
When importing from files, the maximum number of bytes in one input split. Default: The maximum Long value (Long.MAX_VALUE).
-max_threads
The maximum number of threads that run mlcp. This command line option is optional.
-max_thread_percentage
The maximum percentage (integer between 0 and 100) of available server threads used by mlcp for import jobs. Default: 100.
-min_split_size number
When importing from files, the minimum number of bytes in one input split. Default: 0.
-mode string
Ingestion mode. Accepted values: local.
-modules string
Specify the name of the modules database to use when applying a server-side transformation. Accepted values: filesystem or a modules database name. Default: The modules database associated with the App Server.
-modules_root string
The modules root path to use when applying a server-side transformation. Default: The modules root configured for the App Server. If you also use -modules, then this path specifies the modules root for that modules database.
-namespace string
The default namespace for all XML documents created during loading.
-options_file string
Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_cleandir boolean
Whether or not to delete all content in the output database directory prior to loading. Default: false.
-output_collections comma-list
A comma separated list of collection URIs. Loaded documents are added to these collections.
-output_directory string
The destination database directory in which to create the loaded documents. If the directory exists, its contents are removed prior to ingesting new documents. Using this option enables -fastload by default, which can cause duplicate URIs to be created. See Time vs. Correctness: Understanding -fastload Tradeoffs.
-output_graph string
Only usable with -input_file_type rdf. For quad data, specifies the default graph for quads that do not include an explicit graph label. For other triple formats, specifies the graph into which to load all triples. For details, see Loading Triples.
-output_language string
The xml:lang to associate with loaded documents.
-output_partition string
The name of the database partition in which to create documents. For details, see How Assignment Policy Affects Optimization, and Range Partitions or Query Partitions in the Administrator's Guide.
-output_override_graph string
Only usable with -input_file_type rdf. The graph into which to load all triples. For quads, overrides any graph label in the quads. For details, see Loading Triples.
-output_permissions comma-list
A comma separated list of (role,capability) pairs to apply to loaded documents. Default: The default permissions associated with the user inserting the document. Example: -output_permissions role1,read,role2,update
-output_quality string
The quality of loaded documents. Default: 0.
-output_uri_prefix string
Specify a prefix to prepend to the default URI. Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion.
-output_uri_replace comma-list
A comma separated list of (regex,string) pairs that define string replacements to apply to the URIs of documents added to the database. The replacement strings must be enclosed in single quotes. For example, -output_uri_replace "regex1,'string1',regext2,'string2'"
-output_uri_suffix string
Specify a suffix to append to the default URI Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion.
-polling_init_delay
The initial delay (in minutes) before mlcp starts sending polling request to check the available server threads. Default: 1.
-polling_period
The time interval (in minutes) mlcp sends polling request to check the current available server threads. Default: 1.
-restrict_hosts boolean
Restrict mlcp to connect to MarkLogic only through the hosts listed in the -host option. For more details, see Restricting the Hosts mlcp Uses to Connect to MarkLogic.
-split_input boolean
Whether or not to divide input data into logical chunks to support more concurrency. Only supported when -input_file_type is one of the following: delimited_text. Default: false for local mode. Data that contains multi-byte characters must be UTF-8-encoded to use this option. For details, see Improving Throughput with -split_input.
-ssl boolean
Enable/disable SSL secured communication with MarkLogic. Default: false. If you set this option to true, your App Server must be SSL enabled. For details, see Connecting to MarkLogic Using SSL.
-ssl_protocol string
Specify the protocol mlcp should use when creating an SSL connection to MarkLogic. You must include this option if you use the -ssl option to connect to an App Server configured to disable MarkLogic's default protocol (TLSv1.2). Allowed values: tls, tlsv1, tlsv1.1, tlsv1.2. Default: TLSv1.2.
-keystore_path string

Path to a Java KeyStore containing the User Private Key(s) and Certificate(s); if available mlcp will select the first available certificate from the KeyStore that satisfy the TLS Certificate Request from the MarkLogic Server.

Can be passed along with the existing -ssl option.

-keystore_password string

Password to a Java KeyStore containing the User Private Key(s) and Certificate(s); if available mlcp will select the first available certificate from the KeyStore that satisfy the TLS Certificate Request from the MarkLogic Server.

Can be passed along with the existing -ssl option.

-truststore_path string

Path to a Java TrustStore containing any necessary CA Certificates needed to verify the TLS Server Authentication connection. If no TrustStore is provided the default TrustStore used by the existing -ssl parameter is used.

Can be passed along with the existing -ssl option.

-truststore_passwd string

Password to a Java TrustStore containing any necessary CA Certificates needed to verify the TLS Server Authentication connection. If no TrustStore is provided the default TrustStore used by the existing -ssl parameter is used.

Can be passed along with the existing -ssl option.

-streaming boolean
Whether or not to stream documents to MarkLogic Server. Applies only when -input_file_type is documents.
-temporal_collection string
The temporal collection into which the temporal documents are to be loaded. For details on loading temporal documents into MarkLogic, see Using MarkLogic Content Pump (MLCP) to Load Temporal Documents in the Temporal Developer's Guide.
-thread_count number

The number of threads to spawn for concurrent loading.

Instead of using 4 as the default thread count prior to 10.0-4.2, mlcp now conducts initial polling to identify the available server threads on the port that handles mlcp requests. Mlcp then uses this value as the default thread count. Users can overwrite it by specifying -thread_count in the command line.

-thread_count_per_split number

The maximum number of threads that can be assigned to each split.

If you specify -thread_count_per_split, each input split will run with the specified number.

The total number of thread count, however, is controlled by the newly calculated thread count or -thread_count if it is specified.

-tolerate_errors boolean

NOTE: This option is deprecated, ignored, and will be removed in a future release. mlcp always behaves as if -tolerate_errors is true.

Applicable only when -batch_size is greater than 1. When this option is true and batch size is greater than 1, if an error occurs for one or more documents during loading, only the erroneous documents are skipped; all other documents are inserted into the database. When this option is false or batch size is 1, errors during insertion can cause all the inserts in the current batch to be rolled back. Default: false.

-transform_function string
The local name of a custom content transformation function installed on MarkLogic Server. Ignored if -transform_module is not specified. Default: transform. For details, see Transforming Content During Ingestion.
-transform_module string
The path in the modules database or modules directory of a custom content transformation function installed on MarkLogic Server. This option is required to enable a custom transformation. For details, see Transforming Content During Ingestion.
-transform_namespace string
The namespace URI of the custom content transformation function named by -transform_function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion.
-transform_param string
Optional extra data to pass through to a custom transformation function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion.
-transaction_size number
The number of requests to MarkLogic Server per transaction. Default: 1. Maximum: 4000/actualBatchSize.
-type_filter comma-list
A comma-separated list of document types. Only usable with -input_file_type forest. mlcp imports only documents with these types. This option can be combined with other filter options. Default: Import all documents.
-uri_id string

Specify a field, XML element name, or JSON property name to use as the basis of the output document URIs when importing delimited text, aggregate XML, or line-delimited JSON data.

With -input_file_type aggregates or -input_file_type delimited_json, the element, attribute, or property name within the document to use as the document URI. Default: None; the URI is based on the file name, as described in Default Document URI Construction.

With -input_file_type delimited_text, the column name that contributes to the id portion of the URI for inserted documents. Default: The first column.

-xml_repair_level string
The degree of repair to attempt on XML documents in order to create well-formed XML. Accepted values: default, full, none. Default: default, which depends on the configured MarkLogic Server default XQuery version: In XQuery 1.0 and 1.0-ml the default is none. In XQuery 0.9-ml the default is full.

We do not recommend using concurrent mlcp jobs. Regardless of the version, mlcp doesn't support concurrent jobs if mlcp is importing from/exporting to the same data file. In addition, beginning in 10.0-4.2, each mlcp job uses the maximum number of threads available on the server as the default thread count (more about this can be found in the 10.0-4.2 release notes). Therefore, using concurrent mlcp jobs will not improve performance, as one job is already using full concurrent capacity.

« Previous chapter
Next chapter »