Loading TOC...
Loading Content Into MarkLogic Server (PDF)

Loading Content Into MarkLogic Server — Chapter 8

Using MarkLogic Content Pump

MarkLogic Content Pump (mlcp) is a command line tool for getting data into and out of a MarkLogic Server database. This chapter covers the following topics:

MarkLogic Content Pump Overview

Using mlcp, you can import documents and metadata to a database, export documents and metadata from a database, or copy documents and metadata from one database to another. For example:

  • Import content into a MarkLogic Server database from flat files, compressed ZIP and GZIP files, or mlcp database archives.
  • Create from flat files, delimited text files, Hadoop sequence files, and aggregate XML files. For details, see Importing Content Into MarkLogic Server.
  • Import mixed content types from a directory, using the file suffix and MIME type mappings to determine document type. Unrecognized/missing suffixes are imported as binary documents. For details, see How mlcp Determines Document Type.
  • Export the contents of a MarkLogic Server database to flat files, a compressed ZIP file, or an mlcp database archive. For details, see Exporting Content from MarkLogic Server.
  • Copy content and metadata from one MarkLogic Server database to another. For details, see Copying Content Between Databases.

The mlcp tool has two modes of operation:

  • Local: mlcp drives all its work on the host where it is invoked. Resources such as import data and export destination must be reachable from that host.
  • Distributed: mlcp distributes its workloads across the nodes in a Hadoop cluster. Resources such as import data and export destination must be reachable from the cluster, which usually means via HDFS.

Distributed mode requires a Hadoop installation. For details, see Configuring Distributed Mode.

To understand the difference between the two modes, consider the following: When loading documents in local mode, all the input data must be reachable from the host on which mlcp is run, and all communication with MarkLogic Server is through that host. Throughput is limited by such things as memory and network bandwidth available to the host running mlcp. When loading documents in distributed mode, multiple nodes in a Hadoop cluster communicate with MarkLogic Server, so greater concurrency can be achieved, while placing fewer resource demands on any one host.

Terms and Definitions

You should be familiar with the following terms and definitions when using mlcp:

Term Definition
aggregate XML content that includes recurring element names and which can be split into multiple documents with the recurring element as the document root. For details, see Splitting Large XML Files Into Multiple Documents.
archive A compressed MarkLogic Server database archive created using the mlcp export command. You can use an archive to restore or copy database content and metadata with the mlcp import command. For details, see Exporting to an Archive.
HDFS The Hadoop Distributed File System, which can be used as an input source or an output destination in distributed mode.
sequence file A flat file of binary key-value pairs in one of the Apache Hadoop SequenceFile formats. The mlcp tool only supports importing Text and BytesWritable values from a sequence file.
split The unit of work for one thread in local mode or one MapReduce task in distributed mode.

Modifying the Example Commands for Windows

All the examples in this guide use Unix command line syntax. If you are using mlcp with the Windows command interpreter, Cmd.exe, use the following guidelines to construct equivalent commands:

  • Replace mlcp.sh with mlcp.bat.
  • For aesthetic reasons, long example command lines are broken into multiple lines using the Unix line continuation character '\'. Remove the line continuation characters and place the entire command on one line, or replace the line continuation characters with the Windows equivalent, '^'.
  • Replace option arguments enclosed in single quotes (') with double quotes ("). If the single-quoted string contains embedded double quotes, escape the inner quotes.
  • Escape any unescaped characters that have special meaning to the Windows command interpreter.

For example, the following Unix command line:

$ mlcp.sh import -host localhost -port 8006 -username user \
    -password passwd -input_file_path /space/bill/data -mode local \
    -output_uri_replace "/space,'',/bill/data/,'/will/'" \
    -output_uri_prefix /plays

Corresponds to this Windows command line:

C:\Example> mlcp.bat import -host localhost -port 8006 -username user ^
    -password passwd -input_file_path C:\space\bill -mode local ^
    -output_uri_replace "C:\space,''" ^
    -output_uri_prefix /plays

Downloading and Installing mlcp

This section covers the following topics:

Supported Platforms

In local mode, mlcp is supported on the same platforms as MarkLogic Server, including 64-bit Linux, Windows, Solaris; and Macintosh OS X.

For details, see Supported Platforms in the Installation Guide.

Distributed mode is only supported on 64-bit Linux.

Required Software

The following software is required to use mlcp:

  • MarkLogic Server version 5.0-5, or MarkLogic Server version 6.0-1 or later, with an XDBC App Server configured to use the target database. The App Server must not be SSL-enabled.
  • Java JRE 1.6. The Oracle/Sun JRE is recommended.

In distributed mode, mlcp also requires a Hadoop MapReduce installation. The following distributions are supported. Though only these distributions are supported, mlcp might work with other distributions based on equivalent versions of Apache Hadoop.

Installing mlcp

Follow these instructions to download and install mlcp.

  1. Download MarkLogic Content Pump from developer.marklogic.com.
  2. Unpack the mlcp distribution to a location of your choice. This creates a directory named mlcp-HadoopN-version, where N is the compatible Hadoop major version and version is the mlcp version. For example, assuming /space/marklogic contains the mlcp zip file, then the following commands install mclp under /space/marklogic/mlcp-Hadoop2-1.1/:
    $ cd /space/marklogic
    $ unzip mlcp-Hadoop2-1.1-bin.zip
  3. Optionally, put the mlcp bin directory on your path. For example:
    $ export PATH=${PATH}:/space/marklogic/mlcp-Hadoop2-1.1/bin
  4. Put the java command on your path. For example:
    $ export PATH=${PATH}:$JAVA_HOME/bin
  5. If you plan to use mlcp in distributed mode, you must have a Hadoop installation and must configure your environment so mlcp can find your Hadoop installation. For details, see Configuring Distributed Mode.

Security Considerations

When you use mlcp, you supply the name of a user(s) with which to interact with MarkLogic Server. If the user does not have admin privileges, then the user must have at least the privileges listed in the table below.

Additional privileges may be required. These roles only enable use of MarkLogic Server as a data source or destination. For example, these roles do not grant read or update permissions to the database.

mlcp Command Privilege Notes
import hadoop-user-write Applies to the user name specified with -username. It is recommended that you also set -output_permission to set the permissions on inserted documents.
export hadoop-user-read Applies to the user name specified with -username.
copy

hadoop-user-read (input)

hadoop-user-write (output)

The -input_username user have the hadoop-user-read privilege on source MarkLogic Server instance.

The -output_username user must have the hadoop-user-write privilege on destination MarkLogic Server instance.

You cannot use mlcp with an SSL enabled App Server.

Using mlcp with a Multi-Host MarkLogic Cluster

To optimize performance, mlcp interacts with MarkLogic Server at the forest level. When MarkLogic Server is deployed across a multi-host cluster, the forests of a database can be distributed across multiple nodes, which means mlcp can interact with multiple hosts in your MarkLogic cluster. This interaction imposes some cluster configuration requirements:

For more information about clustering, see Clustering in MarkLogic Server in the Scalability, Availability, and Failover Guide.

All Forest Hosts Must Be Reachable

Mlcp uses the connection information provided on the command line to make initial contact with MarkLogic Server and 'discover' the network topology of the target database(s). Thereafter, mlcp interacts directly with the hosts that contain forests in the database.

Therefore, all hosts that contain forests of a database accessed by mlcp must be reachable by hostname. That is, the host name under which a forest host joined the MarkLogic cluster must be resolvable by mlcp and the host must be accessible on the network.

For example, if a forest host joins the MarkLogic cluster as HostA, but is only resolvable as HostA-Alias in the subnet from which you execute mlcp, then mlcp will not be able to communicate with that host.

In local mode, the forest hosts must be reachable from the host where you execute the mlcp command. In distributed mode, the forest hosts must reachable from the worker nodes of your Hadoop cluster.

Export and Copy Require In-Forest Evaluation of Queries

When you export or copy documents, mlcp uses in-forest query evaluation to identify the desired documents in each forest of the source database. Therefore, each host that contains a forest in the source database must be configured to act as both an e-node and a d-node.

That is, each host must be capable of providing both query evaluation and data services. A pure d-node (for example, a host with a very small configured expanded tree cache) is not usable with the mlcp export or copy commands.

Cluster-Wide XDBC Configuration Requirements

Because mlcp uses an XDBC App Server and in-forest query evaluation, your cluster might need special configuration to support MapReduce jobs.

Each host that has at least one forest attached to a database mlcp interacts with must have an XDBC App Server configured for that database. Additionally, the XDBC App Server must listen on the same port on each host. For a copy operation, the source App Server, database and port can be the same or different from the destination App Server, database and port.

Hosts within a group share the same App Server configuration, so you only need additional App Servers if hosts with forests attached to the input or output database are in multiple groups.

For example, the cluster shown below is properly configured to use the database with mlcp. The database has 3 forests, located on 3 hosts in 2 different groups. Therefore, both Group 1 and Group 2 must make the database accessible on port 9001.

Configuring Distributed Mode

Using mlcp in distributed mode requires a Hadoop installation. For supported versions, see Required Software.

Hadoop does not have to be installed on the host where you run mlcp, but the Hadoop configuration files must be reachable by mlcp.

Specifying the Hadoop Configuration File Location

You must tell mlcp where to find the Hadoop configuration files on the host where you run mlcp. Hadoop does not need to be installed on this host, but the Hadoop configuration files must be reachable.

Use one of the following methods to tell mlcp where to find your Hadoop configuration files locally:

  • Set the mlcp command line option -hadoop_conf_dir. For example:
    $ mlcp.sh command -hadoop_conf_dir /etc/hadoop/conf
  • Set the environment variable HADOOP_CONF_DIR. For example:
    $ export HADOOP_CONF_DIR=/etc/hadoop/conf

If your Apache Hadoop installation is on a remote host, you can copy the configuration files locally and set HADOOP_CONF_DIR (or -hadoop_conf_dir) to that directory.

Setting Custom Hadoop Options and Properties

If you want to pass through additional Hadoop properties to tune mlcp's use of Hadoop in distributed mode, use these options:

  • -conf conf_filename : Pass in a Hadoop configuration properties file.
  • -D property=value : Pass one Hadoop configuration property setting.

The property names and configuration file syntax is as dictated by Hadoop. For details, consult the documentation for your Hadoop distribution.

These Hadoop options must appear on the command line after -options_file (if present) and before any mlcp-specific options.

Required Hadoop User Privileges

When you use distributed mode for import, the user your Hadoop tasks run as must have permission to access the directories or files in specified by -input_file_path. Similarly, when you use distributed mode for export, the user must have permission to create directories and files in the directory specified by -output_file_path.

Importing Content Into MarkLogic Server

You can insert content into a MarkLogic Server database from flat files, compressed ZIP and GZIP files, aggregate XML files, Hadoop sequence files, and MarkLogic Server database archives. The input data can be accessed from the native filesystem or HDFS.

For a list of import related options, see Import Options.

This section covers the following topics:

Supported Input Format Summary

The following table provides a quick reference of the supported input file types, along with the allowed document types for each, and whether or not they can be passed to mlcp as compressed files.

-input_file_type Document Type -input_compressed permitted
documents XML, text, or binary; controlled with -document_type. Yes
archive As in the database: XML, text, and/or binary documents, plus metadata. The type is not under user control. No (archives are already in compressed format)
delimited_text XML Yes
sequencefile

XML, text or binary; controlled with these options:

-input_sequencefile_value_class

-input_sequencefile_value_type.

No. However, the contents can be compressed when you create the sequence file. Compression is bound up with the value class you use to generate and import the file.
aggregates XML Yes

When the input file type is documents or sequencefile, you must consider the both input format (-input_file_type) and the output document format (-document_type). In addition, for some input formats, input can come from either compressed or uncompressed files (-input_compressed).

Use the -input_file_type option to tell mlcp the format of the data in each input file (or each entry inside a compressed file). This option controls if/how mlcp converts the content into database documents. The default input type is documents, which means each input file or ZIP file entry creates one database document.

All other input file types represent composite input formats which can yield multiple database documents per input file. The mlcp tool supports the following composite input file types: Aggregate XML (aggregate), delimited text files (delimited_text), Hadoop sequence files (sequencefile), and database archives (archive).

The -document_type option controls the database document format when -input_file_type is documents or sequencefile. MarkLogic Server supports text, XML or binary documents. If the document type is not explicitly with these input file types, mlcp uses the input file suffix to determine the type. For details, see How mlcp Determines Document Type.

To illustrate how the -input_file_type and -document_type fit together, consider a a Hadoop sequence file that contains binary values. You would set the following options:

  • -input_file_type sequencefile
  • -document_type binary

If the sequence file contained text rather than binary values, then -input_file_type is unchanged, but -document_type becomes text:

  • -input_file_type sequencefile
  • -document_type text (or xml, if the values are valid XML)

Understanding Input File Path Resolution

If you do not explicitly include a URI scheme prefix such as file: or hdfs: on the input file path, mlcp uses the following rules to locate the input path:

  • In local mode, mlcp defaults to the local file system (file).
  • In distributed mode, mlcp defaults to the Hadoop default scheme, which is usually HDFS. The Hadoop default scheme is configurable through the Hadoop configuration parameter fs.default.name.

    In distributed mode, the file scheme (file:) refers to the local filesystem of the Hadoop cluster nodes to which the job is distributed. For example, if you perform an inport in distributed mode with an input file path that uses the file: prefix, the input files must be reachable along that path from all nodes in your Hadoop cluster.

The following example loads files from the local filesystem directory /space/bill/data:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password passwd -input_file_path /space/bill/data -mode local

The following example loads files from the native filesystem of each host in a Hadoop cluster, assuming /space/bill/data is a shared network path on all hosts in the Hadoop cluster:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password passwd -input_file_path file:/space/bill/data \
    -mode distributed

Controlling Database URIs During Ingestion

By default, the document URIs created by mlcp during ingestion are determined by the input source. The tool supports several command line options for modifying this default behavior.

Default Document URI Construction

The default database URI assigned to ingested documents depends on the input source. Loading content from the local filesystem or HDFS can create different URIs than loading the same content from a ZIP file or archive. Command line options are available for you to modify this behavior.

The following table summarizes the default behavior with several input sources:

Input Source Default URI Example
documents in a native or HDFS directory /path/filename /space/data/bill/dream.xml
documents in a ZIP or GZIP file /path/inside/zip/filename If the zip file contains a directory entry bill/, then the document URI for dream.xml in that directory is: bill/dream.xml
a GZIP compressed document /path/filename-without-gzip-suffix If the input is /space/data/big.xml.gz, the result is /space/data/big.xml.
delimited text file first_column_value For a record of the form 'first,second,third' where Column 1 is the id: first
archive The docment URI from the source database.
sequence file The key in a key-value pair
aggregate

With mlcp v1.0-2 and earlier:

hash_or_taskid-seqnum

Where hash_or_taskid is a hash of the split number in local mode and the Hadoop map task id in distributed mode; and seqnum begins with 1 and increments for each document created.

1243498756817-1 1243498756817-2

With mlcp v1.0-3 and later:

/path/filename-split_start-seqnum

Where /path/filename is the full path to the input file, split_start is the byte position from the beginning of the split, and seqnum begins with 1 and increments for each document created.

For input file /space/data/big.xml:

/space/data/big.xml-0-1 /space/data/big.xml-0-2

For example, the following command loads all files from the file systemdirectory /space/bill/data into the database attached to the App Server on port 8006. The documents inserted into the database have URIs of form /space/bill/data/filename.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password passwd -input_file_path /space/bill/data -mode local

If the /space/bill/data directory is zipped up into bill.zip, such that bill/ is the root directory in zip file, then the following command inserts documents with URIs of the form bill/data/filename:

# Windows users, see Modifying the Example Commands for Windows 
$ cd /space; zip -r bill.zip bill
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password passwd -input_file_path /space/bill.zip \
    -mode local -input_compressed true
Transforming the Default URI

Use the following options to tailor the database URI of inserted documents:

  • -output_uri_replace performs one or more string substitutions on the default URI.
  • -output_uri_prefix prepends a string to the URI after substitution.
  • -output_uri_suffix appends a string to the URI after substitution.

The -output_uri_replace option accepts a comma delimited list of regular expression and replacement string pairs. The string portion must be enclosed in single quotes:

-output_uri_replace pattern,'string',pattern,'string'

For details on the regular expression language supported by -output_uri_replace, see Regular Expression Syntax.

These options are applied after the default URI is constructed and encoded, so if the option values contain characters not allowed in a URI, you must encode them yourself. See Character Encoding of URIs.

The following example loads documents from the filesystem directory /space/bill/data. The default output URIs would be of the form /space/bill/data/filename. The example uses -output_uri_replace to replace 'bill/data' with 'will' and strip off '/space/', and then adds a '/plays' prefix using -output_uri_prefix. The end result is output URIs of the form /plays/will/filename.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password passwd -input_file_path /space/bill/data -mode local \
    -output_uri_replace "/space,'',/bill/data/,'/will/'" \
    -output_uri_prefix /plays
Character Encoding of URIs

If a URI constructed by mlcp contains special characters that are not allowed in URIs, mlcp automatically encodes them. This applies to the special characters ' ' (space), '%', '?' or '#'. For example, 'foo bar.xml' becomes 'foo%20bar.xml'.

If you supply a URI or URI component, you are responsible for ensuring the result is a legitimate URI. No automatic encoding takes place. This applies to -output_uri_replace, -output_uri_prefix, and -output_uri_suffix. The changes implied by these options are applied after mlcp encodes the default URI.

When mlcp exports documents from the database to the file system (or HDFS) such that the output directory and/or file names are derived from the document URI, the special symbols are decoded. That is, 'foo%bar.xml' becomes 'foo bar.xml' when exported. For details, see How URI Decoding Affects Output File Names.

How mlcp Determines Document Type

The document type determines what kind of database document mlcp inserts from input content: Text, XML, or binary. Document type is determined in the following ways:

  • Document type can be inherent in the input file type. For example, delimited_text, aggregates, and rdf input files always insert XML documents. For details, see Supported Input Format Summary.
  • You can specify a document type explicitly with -document_type. For example, to load documents as XML, use -input_file_type documents -document_type xml. You cannot set an explicit type for all input file types.
  • Mlcp can determine document type dynamically from the output document URI and the MarkLogic Server MIME type mappings when you use -input_file_type documents -document_type mixed.

If you set -document_type to an explicit type such as -document_type xml, then mlcp inserts all documents as that type.

If you use -document_type mixed, then mlcp determines the document type from the output URI suffix and the MIME type mapping configured into MarkLogic Server. Mixed is the default behavior for -input_file_type documents.

You can only use -document_type mixed when the input file type is documents.

If an unrecognized or unmapped file extension is encountered when loading mixed documents, mlcp creates a binary document.

The following table contains examples of applying the default MIME type mappings to output URIs with various file extensions, an unknown extension, and no extension. The default mapping includes many additional suffixes. You can examine and create MIME type mappings under the Mimetypes section of the Admin Interface. For more information, see Implicitly Setting the Format Based on the MIME Type.

URI Document Type
/path/doc.xml
XML
/path/doc.jpg
binary
/path/doc.txt
text
/path/doc.unknown
binary
/path/doc-nosuffix
binary

The MIME type mapping is applied to the final output URI. That is, the URI that results from applying the URI transformation options described in Controlling Database URIs During Ingestion. The following table contains examples of how URI transformations can affect the output document type in mixed mode, assuming the default MIME type mappings.

Input Filename URI Options Output URI Doc Type
/path/doc.1
None
/path/file.1
binary
/path/doc.1

Add a .xml suffix:

-output_uri_suffix ".xml"

/path/file.xml
XML
/path/doc.1

Replace the unmapped suffix with .txt:

-output_uri_replace "\.\d+,'.txt'"

/path/file.txt
text

Loading Documents from a Directory

This section discusses importing documents stored as flat files on the native filesystem or HDFS. The following topics are covered:

Loading a Single File

Use the following procedure to load all the files in a native or HDFS directory and its sub-directories. To load selected files, see Filtering Documents Loaded From a Directory.

  1. Set -input_file_path to the path to the input file.
  2. Set -input_file_type if your input files are not documents. For example, if loading from delimited text files, sequence files, aggregate XML files, or database archives.
  3. Set -document_type if -input_file_type is not documents and the content type cannot be accurately deduced from the file suffixes as described in How mlcp Determines Document Type.
  4. Set -mode:
    • If Hadoop is available and you want to distribute the workload across a Hadoop cluster, set -mode to distributed.
    • If Hadoop is not installed or you want mlcp to perform the work locally, set -mode to local. (This is the default mode).

      If you are loading from the native filesystem in distributed mode or from HDFS in local mode, you might need to qualify the input file path with a URI scheme of file: or hdfs:. See Understanding Input File Path Resolution.

By default, the imported document has a database URI based on the input file path. For details, see Controlling Database URIs During Ingestion.

The following example command loads a single XML file:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password password -input_file_path /space/bill/data/hamlet.xml
Loading All the Files in a Directory

Use the following procedure to load all the files in a native or HDFS directory and its sub-directories. To load selected files, see Filtering Documents Loaded From a Directory.

  1. Set -input_file_path to the input directory.
  2. Set -input_file_type if your input files are not documents. For example, if loading from delimited text files, sequence files, aggregate XML files, or database archives.
  3. Set -document_type if -input_file_type is not documents and the content type cannot be accurately deduced from the file suffixes as described in How mlcp Determines Document Type.
  4. Set -mode:
    • If Hadoop is available and you want to distribute the workload across a Hadoop cluster, set -mode to distributed.
    • If Hadoop is not installed or you want mlcp to perform the work locally, set -mode to local. (This is the default mode).

      If you are loading from the native filesystem in distributed mode or from HDFS in local mode, you might need to qualify the input file path with a URI scheme of file: or hdfs:. See Understanding Input File Path Resolution.

By default, the imported documents have database URIs based on the input file path. For details, see Controlling Database URIs During Ingestion.

The following example command loads all the files in /space/bill/data:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password password -input_file_path /space/bill/data
Filtering Documents Loaded From a Directory

If -input_file_path names a directory, mlcp loads all the documents in the input directory and subdirectories by default. Use the -input_file_pattern option to filter the loaded documents based on a regular expression.

For example, the following command loads only files with a '.xml' suffix from the directory /space/bill/data:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password password -input_file_path /space/bill/data \
    -mode local -input_file_pattern '.*\.xml'

The mlcp tool uses Java regular expression syntax. For details, see Regular Expression Syntax.

Loading Documents From Compressed Files

You can load content from one or more compressed files. Filtering of compressed file content is not supported; mlcp loads all documents in a compressed file.

Follow this procedure to load content from one or more ZIP or GZIP compressed files.

  1. Set -input_file_path:
    • To load from a single file, set -input_file_path to the path to the compressed file.
    • To load from multiple files, set -input_file_path to a directory containing the compressed files.
  2. If the content type cannot be accurately deduced from suffixes of the files inside the compressed file as described in How mlcp Determines Document Type, set -document_type appropriately.
  3. Set -input_compressed to true.
  4. If the compressed file suffix is not '.zip' or '.gzip', specify the compressed file format by setting -input_compression_codex to zip or gzip.

If you set -document_type to anything but mixed, then the contents of the compressed file must be homogeneous. For example, all XML or all binary.

The following example command loads binary documents from the compressed file /space/images.zip on the local filesystem.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password password -mode local -document_type binary \
    -input_file_path /space/images.zip -input_compressed

The following example loads all the files in the compressed file /space/example.jar, using -input_compression_codec to tell mlcp the compression format because of the '.jar' suffix:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password passwd -mode local -input_file_path /space/example.jar \
    -input_compressed true -input_compression_codec zip

If -input_file_path is a directory, mlcp loads contents from all compressed files in the input directory, recursing through subdirectories. The input directory must not contain other kinds of files.

By default, the URI prefix on documents loaded from a compressed file mirrors the directory hierarchy inside the compressed file. For example, if a ZIP file contains bill/data/dream.xml then the ingested document URI is also bill/data/dream.xml. To override this behavior, see Controlling Database URIs During Ingestion.

Loading Content and Metadata From an Archive

Follow this procedure to import content and metadata from a database archive created by the mlcp export command. A database archive is stored in one or more compressed files that contain documents and metadata.

  1. Set -input_file_path:
    • To load a single archive file, set -input_file_path to that file.
    • To load multiple archive files, set -input_file_path to a directory containing the compressed archive files.
  2. Set -document_type to mixed, or leave it unset since mixed is the default setting.
  3. Set -input_compressed to true.
  4. Set -input_file_type to archive.
  5. If the input archive was created without any metadata, set -archive_metadata_optional to true. If this is not set, an exception is thrown if the archive contains no metadata.
  6. If you want to exclude some or all of the document metadata in the archive:
    • Set -copy_collections to false to exclude document collections metadata.
    • Set -copy_permissions to false to exclude document permissions metadata.
    • Set -copy_properties to false to exclude document properties.
    • Set -copy_quality to false to exclude document quality metadata.

An archive is assumed to contain metadata. However, it is possible to create archives without metadata by setting all the metadata copying options (-copy_collections, -copy_permissions, etc.) to false during export. If an archive does not contain metadata, you must set -archive_metadata_optional to tell mlcp to proceed in the absence of metadata.

The following example command loads the database archive in /space/archive_dir:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password password -mode local -input_file_type archive \
    -input_file_path /space/archive_dir

Splitting Large XML Files Into Multiple Documents

Very large XML files often contain aggregate data that can be disaggregated by splitting it into multiple smaller documents rooted at a recurring element. Disaggregating large XML files consumes fewer resources during loading and improves performance when searching and retrieving content.

The following mlcp options support creating multiple documents from aggregate data:

  • -aggregate_record_element
  • -aggregate_uri_id
  • -aggregate_record_namespace

You can disaggregate XML when loading from either flat or compressed files. For more information about working with compressed files, see Loading Documents From Compressed Files.

Follow this procedure to create documents from aggregate XML input:

  1. Set -input_file_path:
    • To load from a single file, set -input_file_path to the path to the aggregate XML file.
    • To load from multiple files, set -input_file_path to a directory containing the aggregate files. The directory must not contain other kinds of files.
  2. If you are loading from a compressed file, set -input_compressed.
  3. Set -input_file_type to aggregates.
  4. Set -aggregate_record_element to the element QName of the node to use as the root for all inserted documents. See the example below. The default is the first child element under the root element.

    The element QName should appear at only one level. You cannot specify the element name using a path, so disaggregation occurs everywhere that name is found.

  5. Optionally, override the default document URI by setting -aggregate_uri_id to the name of the element from which to derive the document URI.

The default URI is hashcode-seqnum in local mode and taskid-seqnum in distributed mode. If there are multiple matching elements, the first match is used.

  1. If the aggregate record element is in a namespace, set -aggregate_record_namespace to the input namespace.

The example below uses the following input data:

$ cat > example.xml
<?xml version="1.0" encoding="UTF-8"?>
<people>
  <person>
    <first>George</first>
    <last>Washington</last>
  </person>
  <person>
    <first>Betsy</first>
    <last>Ross</last>
  </person>
</people>

The following command breaks the input data into a document for each <person> element. The -aggregate_uri_id and other URI options give the inserted documents meaningful names. The command creates URIs of the form '/people/lastname.xml' by using the <last/> element as the aggregate URI id, along with an output prefix and suffix:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password password -mode local -input_file_path example.xml \
    -input_file_type aggregates -aggregate_record_element person \
    -aggregate_uri_id last -output_uri_prefix /people/ \
    -output_uri_suffix .xml

The command creates two documents: /people/Washington.xml and /people/Ross.xml. For example, /people/Washington.xml contains:

<?xml version="1.0" encoding="UTF-8"?>
<person>
    <first>George</first>
    <last>Washington</last>
</person>

If the input data is in a namespace, set -aggregate_record_namespace to that namespace. For example, if the input data is modified to include a namespace:

$ cat > example.xml
<?xml version="1.0" encoding="UTF-8"?>
<people xmlns="http://marklogic.com/examples">...</people>

Then mlcp ingests no documents unless you set -aggregate_record_namespace. Setting the namespace creates two documents in the namespace 'http://marklogic.com/examples'. For example, after running the following command:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username user \
    -password password -mode local -input_file_path example.xml \
    -input_file_type aggregates -aggregate_record_element person \
    -aggregate_uri_id last -output_uri_prefix /people/ \
    -output_uri_suffix .xml \
    -aggregate_record_namespace "http://marklogic.com/examples"

The document with URI '/people/Washington.xml' contains :

<?xml version="1.0" encoding="UTF-8"?>
<person xmlns="http://marklogic.com/examples">
    <first>George</first>
    <last>Washington</last>
</person>

Creating Documents from Delimited Text Files

Use the delimited_text input file type to import content from a delimited text file and create an XML document corresponding to each line.

The first line in the input file should contain column headers. For each line after the header line, mlcp creates an XML document with a root node of <root> and child elements with names corresponding to each column title.

For example, given the following data and mlcp command:

# Windows users, see Modifying the Example Commands for Windows 
$ cat example.csv
first,last
george,washington
betsy,ross
$ mlcp.sh ... -mode local -input_file_path /space/mlcp/data \
    -input_file_type delimited_text

mlcp creates two documents with the following contents:

<root>
  <first>george</first>
  <last>washington</last>
</root>
<root>
  <first>betsy</first>
  <last>ross</last>
</root>

By default, the document URIs use the value in the first column. In the example above, the two documents have URIs corresponding to the 'first' value, or 'george' and 'betsy'. Use -delimited_uri_id to choose a different column. For example, the following command creates the documents 'washington' and 'ross':

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh ... -mode local -input_file_path /space/mlcp/data \
    -input_file_type delimited_text -delimited_uri_id last

You can further tailor the URIs using -output_uri_prefix and -output_uri_suffix. For details, see Controlling Database URIs During Ingestion.

Creating Documents from Hadoop Sequence Files

A Hadoop sequence file is a flat binary file of key-value pairs. You can use mlcp to create a document from each key-value pair. The only supported value types are Text and BytesWritable. This section covers the following topics:

Basic Steps

You must implement a Hadoop SequenceFile reader and writer that also implements 2 special mlcp interfaces. To learn more about Apache Hadoop SequenceFile, see http://wiki.apache.org/hadoop/SequenceFile/.

  1. Implement com.marklogic.contentpump.SequenceFileKey and com.marklogic.contentpump.SequenceFileValue.
  2. Generate one or more sequence files using your classes.
  3. Deploy your classes into mlcp_install_dir/lib.
  4. Use the mlcp import command to create documents from your sequence files.

The source distribution of mlcp, available from http://developer.marklogic.com, includes an example in com.marklogic.contentpump.examples.

Implementing the Key and Value Interfaces

You must read and write your sequence files using classes that implement com.marklogic.contentpump.SequenceFileKey and com.marklogic.contentpump.SequenceFileValue. These interfaces are included in the mlcp jar file:

mlcp_install_dir/lib/mlcp-HadoopN-version.jar

Where N is the Hadoop major version supported by your mlcp installation and version is your mlcp version. For example, if you install mlcp v1.1 compatible with Hadoop v2 to /opt/mlcp-Hadoop2-1.1, then the jar file is:

/opt/mlcp-Hadoop2-1.1/lib/mlcp-Hadoop2-1.1.jar

Source and an example implementation are available in the mlcp source distribution on developer.marklogic.com.

Your key class must implement the following interface:

package com.marklogic.contentpump;

import com.marklogic.mapreduce.DocumentURI;

public interface SequenceFileKey {
    DocumentURI getDocumentURI();
}

Your value class must implement the following interface:

package com.marklogic.contentpump;

public interface SequenceFileValue<T> {
    T getValue();
}

For an example, see com.marklogic.contentpump.example.SimpleSequenceFileKey and com.marklogic.contentpump.example.SimpleSequenceFileValue.

These interfaces depend on Hadoop and the MarkLogic Connector for Hadoop. The connector library is included in the mlcp distribution as:

mlcp_install_dir/lib/marklogic-mapreduceN-version.jar

where N is the Hadoop major version and version is the connector version. The Hadoop major version will correspond to the Hadoop major version of your mlcp distribution. For example, if you install the Hadoop v2 compatible version of mlcp, then the connector jar file name might be:

marklogic-mapreduce2-1.2.jar

For details, see the MarkLogic Connector for Hadoop Developer's Guide and the MarkLogic Hadoop MapReduce Connector API.

You must implement a sequence file creator. You can find an example in com.marklogic.contentpump.examples.SimpleSequenceFileCreator.

When compiling your classes, include the following on the Java class path:

  • mlcp_install_dir/lib/mlcp-HadoopN-version.jar
  • mlcp_install_dir/lib/marklogic-mapreduceN-version.jar
  • mlcp_install_dir/lib/hadoop-core-version.jar

For example, if you are using the Hadoop v2 compatible version of mlcp:

$ javac -cp $MLCP_DIR/lib/mlcp-Hadoop2-1.1.jar:$MLCP_DIR/lib/marklogic-mapreduce2-1.2.jar:$MLCP_DIR/lib/hadoop-core-2.0.0-mr1-cdh4.3.0.jar MyKey.java MyValue.java
$ jar -cf myseqfile.jar *.class
Deploying your Key and Value Implementation

Once you compile your SequenceFileKey and SequenceFileValue implementations into a JAR file, copy your JAR file and any dependent libraries into the mlcp lib/ directory so that mlcp can find your classes at runtime. For example:

$ cp myseqfile.jar /space/mlcp-Hadoop2-1.1/lib
Loading Documents From Your Sequence Files

Once you have created one or more sequence files using your implementation, you can create a document from each key-value pair using the following procedure:

  1. Set -input_file_path:
    • To load from a single file, set -input_file_path to the path to the file.
    • To load from multiple files, set -input_file_path to a directory containing the sequence files.
  2. Set -sequencefile_key_class to the name of your SequenceFileKey implementation.
  3. Set -sequencefile_value_class to the name of your SequeneFileValue implementation.
  4. Set -sequencefile_value_type to either Text or BytesWritable, depending on the contents of your sequence files.
  5. Set -input_file_type to sequencefile.

By default, the key in each key-value pair is used as the document URI. You can further tailor the URI using command line options, as described in Controlling Database URIs During Ingestion.

For an example, see Running the SequenceFile Example.

Running the SequenceFile Example

This section walks you through creating a sequence file and loading its contents as documents.

Create an input text file from which to create a sequence file. The file should contain pairs of lines where the first line is a URI that acts as the key, and the second line is the value. For example:

$ cat > seq_input.txt
/doc/foo.xml
<foo/>
/doc/bar.xml
<bar/>

To use the example classes provided with mlcp, put the following libraries on your Java classpath:

  • mlcp_install_dir/lib/mlcp-HadoopN-version.jar
  • mlcp_install_dir/lib/hadoop-core-version.jar
  • mlcp_install_dir/lib/commons-logging-1.1.1.jar
  • mlcp_install_dir/lib/marklogic-mapreduceN-version.jar

For example, if you install the CDH 4.3 MRv1 compatible version of mlcp v1.1, then put the following libraries on your Java classpath:

  • mlcp_install_dir/lib/mlcp-Hadoop2-1.1.jar
  • mlcp_install_dir/lib/hadoop-core-2.0.0-mr1-cdh4.3.0.jar
  • mlcp_install_dir/lib/commons-logging-1.1.1.jar
  • mlcp_install_dir/lib/marklogic-mapreduce2-1.2.jar

Generate a sequence file from your test data using com.marklogic.contentpump.examples.SimpleSequenceFileCreator. The first argument to program is the output sequence file name. The second argument is the input data file name. The following command generates seq_output from seq_input.txt.

$ java com.marklogic.contentpump.examples.SimpleSequenceFileCreator seq_output seq_input.txt

Load the contents of the sequence file into MarkLogic Server:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -user user -password password -host localhost \
    -port 8006 -input_file_path seq_output -mode local \
    -input_file_type sequencefile -sequencefile_key_class \
    com.marklogic.contentpump.examples.SimpleSequenceFileKey \
    -sequencefile_value_class \
    com.marklogic.contentpump.examples.SimpleSequenceFileValue \
    -sequencefile_value_type Text -document_type xml

Two documents are created in the database with URIs /doc/foo.xml and /doc/bar.xml.

Performance Considerations for Loading Documents

MarkLogic Content Pump comes configured with defaults that should provide good performance under most circumstances. This section presents some performance tradeoffs to consider if you want to try to optimize throughput for your workload.

This section covers the following topics:

Time vs. Space: Configuring Batch and Transaction Size

You can tune the document insertion throughput and memory requirements of your job by configuring the batch size and transaction size of the job.

  • -batch_size controls the number of updates per request to the server.
  • -transaction_size controls the number of requests to the server per transaction.

The default batch size is 100. The default transaction size is 10. This means that the default maximum number of updates per transaction is 1000.

Selecting a batch size is a speed vs. memory tradeoff. Each request to the server introduces overhead because extra work must be done. However, unless you use -streaming or -document_type mixed, all the updates in a batch stay in memory until a request is sent, so larger batches consume more more memory.

Transactions introduce overhead on MarkLogic Server, so performing multiple updates per transaction can improve insertion throughput. However, an open transaction holds locks on fragments with pending updates, potentially increasing lock contention and affecting overall application performance.

It is also possible to overwhelm MarkLogic Server if you have too many concurrent sessions active.

Time vs. Correctness: Using Direct Forest Updates

The mlcp import and copy commands perform best when documents can be loaded directly into the destination forests, without going through an e-node. However, direct forest updates can create duplicate URIs under the following circumstances:

  • Content with the same URI already exists in the database, and
  • The content was inserted using user-specified forest placement, or the number of forests available for updates changes after document creation.

User-specified forest placement occurs when you specify an explicit forest id during document loading, such as the $forest-ids parameter to xdmp:document-load.

To prevent duplicate URIs, mlcp defaults to a slower protocol when loading documents. You can override this behavior using the -fastload or -output_directory options. (Using -output_directory implies -fastload). You can safely enable -fastload if the number of forests in the database will not change while mlcp is loading, and at least one of the following is true:

  • The mlcp job only creates new documents. That is, you are certain that the URIs are not in use by any document or property fragments already in the database.
  • The URIs may already be in use, but all these conditions are true:
    • The in-use URIs were not originally inserted using user-specified forest placement.
    • The number of forests available for updates in the database has not changed since the documents were created.
    • The order of forest assignments has not changed since the documents were created. Forest assignment order can change when de-assigning and then re-assigning forests to a database.

      If you import data into database hosted by MarkLogic 7 or later, the database you import data into with -fastload or -output_directory must be using the legacy assignment policy. For details, refer to the documentation for your current version of MarkLogic Server.

Changing the number of forests available for updates while a job is running can also cause duplicate URIs or insertion failures. The number of forests available for update can change for several reasons, including the following:

Tuning Split Size for Local Mode

This discussion applies only to importing whole documents from flat or compressed files in local mode. It does not apply to creating documents from composite files such as delimited text files, sequence files, or aggregate XML files. Split size for composite files is not tunable.

In local mode, a split defines the unit of work per thread. The ideal split size is one that keeps all your threads busy. The default split size is 32M for local mode, and the HDFS block size for distributed mode.

For example, suppose -thread_count is 10 and -max_split_size is 32M and your content consists of 120 small documents 1M in length. Then 32 documents fit into each split and there will only be 4 splits. The run uses only 4 of 10 available threads, leaving the other 6 idle. If you tune -max_split_size down to 12M, you get 10 splits and can maximize concurrency.

Tuning Split Size for Distributed Mode

The following discussion applies only to importing whole documents from flat or compressed files in distributed mode. It does not apply to creating documents from composite files such as delimited text files, sequence files, and aggregate XML files. Split size for composite files is not tunable.

Distributed mode uses Hadoop to import documents from multiple tasks, running on the nodes of a Hadoop cluster. A split is a unit of work for one Hadoop task.

Tuning Hadoop performance, including split size, is a complex topic outside the scope of this document. However, for best performance, tune split size to maximize parallelism across your Hadoop cluster, with each task taking at least a couple minutes. If your split size is too small and each task only runs for a very short time, the task overhead can degrade overall performance.

In distributed mode, split size is determined by the following formula:

max(min_split_size, min(max_split_size, block_size))

The default min_split_size is 0 and the default max_split_size is Long.MAX (the maximum signed long integer). The block_size depends on your Hadoop configuration, but the default is 64M. You can configure the min and max split size using the mlcp options -min_split_size and -max_split_size. You can only tune block size through Hadoop configuration settings.

In addition to balancing the workload across MapReduce tasks, you must also consider the load on MarkLogic Server. Too many concurrent sessions can over tax CPU and memory resources.

Reducing Memory Consumption With Streaming

The streaming protocol allows you to insert a large document into the database without holding the entire document in memory. Streaming uploads documents to MarkLogic Server in 128k chunks.

Streaming content into the database usually requires less memory on the host running mlcp, but ingestion can be slower because it introduces additional network overhead. Streaming also does not take advantage of mlcp's builtin retry mechanism. If an error occurs that is normally retryable, the job will fail.

Streaming is only usable when -input_file_type is documents. You cannot use streaming with delimited text files, sequence files, or archives.

To use streaming, enable the -streaming option. For example:

# Windows users, see Modifying the Example Commands for Windows 
$mlcp.sh import -username user -password password -host localhost \
    -port 8006 -input_file_path /my/dir -streaming

Controlling What is Exported or Copied

By default, mlcp exports all documents or all documents and metadata in the database, depending on whether you are exporting in document or archive format or copying the database. Several command line options are available to enable customization. This section covers the following topics:

Filtering Document Exports

This section covers options available for filtering what is exported by the mlcp export command when -output_type is document.

By default, mlcp exports all documents in the database. That is, mlcp exports the equivalent of fn:collection(). The following options allow you to filter what is exported:

  • -directory_filter - export only the documents in the listed database directories. You cannot use this option with -collection_filter.
  • -collection_filter - export only the documents in the listed collections. You cannot use this option with -directory_filter.

Filtering Archive and Copy Contents

This section covers options available for controlling what is exported by mlcp export when -output_type is archive, or what is copied by mlcp copy.

By default, all documents and metadata are exported/copied. The following options allow you to modify this behavior:

  • -directory_filter - export/copy only the documents in the listed database directories, including related metadata. You cannot use this option with -collection_filter.
  • -collection_filter - export/copy only the documents in the listed collections, including related metadata. You cannot use this options with -directory_filter.
  • -copy_collections - whether to include collection metadata
  • -copy_permissions - whether to include permissions metadata
  • -copy_properties - whether to include naked and document properties
  • -copy_quality - whether to include document quality metadata

If you set all the -copy_* options to false when exporting to an archive, the archive contains no metadata. When you import an archive with no metadata, you must set -archive_metadata_optional to true.

Exporting Content from MarkLogic Server

You can export content in a MarkLogic Server database to files or an archive. Use archives to copy content from one MarkLogic Server database to another. Output can be written to the native filesystem or to HDFS.

For a list of export related command line options, see Export Options.

This section covers the following topics:

Potential Export Output Inconsistency

If consistency between export output and the database contents is important, ensure there are no updates happening in the database during the export. Consistency is not guaranteed if there are updates to the database while the export is running.

For example, if a document changes after mlcp has exported a copy of the document, then the export output does will not match the database state. If the document changes after export begins but before the document is copied out, the new version will be exported, not the version that was present when mlcp began running.

How URI Decoding Affects Output File Names

This discussion only applies when -output_type is document.

When you export a document to a file (or to a file in a compressed file), the output file name is based on the document URI. The document URI is decoded to form the file name. For example, if the document URI is 'foo%20bar.xml', then the output file name is 'foo bar.xml'.

If the document URI does not conform to the standard URI syntax of RFC 3986, decoding may fail, resulting in unexpected file names. For example, if the document URI contains unescaped special characters then the raw URI may be used.

If the document URI contains a scheme, the scheme is removed. If the URI contains both a scheme and an authority, both are removed. For example, if the document URI is 'file:foo/bar.xml', then the output file path is output_file_path/foo/bar.xml. If the document URI is 'http://marklogic.com/examples/bar.xml' (contains a scheme and an authority), then the output file path is output_file_path/examples/bar.xml.

If the document URI includes directory steps, then corresponding output subdirectories are created. For example, if the document URI is '/foo/bar.xml', then the output file path is output_file_path/foo/bar.xml.

Exporting Documents as Files

Use the mlcp export command to export documents in their original format as files on the native filesystem or HDFS. For example, you can export an XML document as a text file containing XML, or a binary document as a JPG image.

To export documents from a database as files:

  1. Select the files to export. For details, see Filtering Document Exports.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select all documents in the database, leave -collection_filter and -directory_filter unset.
  2. Set -output_file_path to the destination file or directory on the native filesystem or HDFS.
  3. To prettyprint exported XML when using local mode, set -indented to true.

Directory names specified with -directory_filter should end with '/'.

Document URIs are URI-decoded before filesystem directories or filenames are constructed for them. For details, see How URI Decoding Affects Output File Names.

For a full list of export options, see Export Options.

The following example exports selected documents in the database to the native filesystem directory /space/mlcp/export/files. The directory filter selects only the documents in /plays.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8006 -username user \
    -password password -mode local -output_file_path \
    /space/mlcp/export/files -directory_filter /plays/

Exporting Documents to a Compressed File

Use the mlcp export command to export documents in their original format as files in a compressed ZIP file on the native filesystem or HDFS.

To export documents from a database as files:

  1. Select the files to export. For details, see Filtering Document Exports.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select all documents in the database, leave -collection_filter and -directory_filter unset.
  2. Set -output_file_path to the destination directory on the native filesystem or HDFS. This directory must not already exist.
  3. Set -compress to true.
  4. To prettyprint exported XML when using local mode, set -indented to true.

For a full list of export options, see Export Options.

The zip files created by export have filenames of the form timestamp-seqnum.zip.

The following example exports all the documents in the database to the directory /space/examples/export on the native filesystem.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8006 -username user \
    -password password -mode local \
    -output_file_path /space/examples/export -compress true$ ls /space/examples/export
20120823135307-0700-000000-XML.zip

Exporting to an Archive

Use the mlcp export command with an output type of archive to create a database archive that includes content and metadata. You can use the mlcp import command to copy the archive to another database or restore database contents.

To export database content to an archive file with mlcp:

  1. Select the documents to export. For details, see Filtering Archive and Copy Contents.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select all documents in the database, leave -collection_filter and -directory_filter unset.
  2. Set -output_file_path to the destination directory on the native filesystem or HDFS. This directory must not already exist.
  3. Set -output_type to archive.
  4. If you want to exclude some or all document metadata from the archive:
    • Set -copy_collections to false to exclude document collections metadata.
    • Set -copy_permissions to false to exclude document permissions metadata.
    • Set -copy_properties to false to exclude document properties.
    • Set -copy_quality to false to exclude document quality metadata.

For a full list of export options, see Export Options.

The following example exports all documents and metadata to the directory /space/examples/exported. After export, the directory contains one or more compressed archive files.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8006 -username user \
    -password password -mode local \
    -output_file_path /space/examples/exported -output_type archive

The following example exports only documents in the database directory /plays/, including their collections, properties, and quality, but excluding permissions:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8006 -username user \
    -password password -mode local \
    -output_file_path /space/examples/exported -output_type archive \
    -copy_permissions false -directory_filter /plays/

You can use the mlcp import command to import an archive into a database. For details, see Loading Content and Metadata From an Archive.

Copying Content Between Databases

Use the mlcp copy command to copy content and associated metadata from one MarkLogic Server database to another when both are reachable on the network.

To copy one database to another with mclp:

  1. Set -input_host, -input_port, -input_username, and -input_password to identify the source MarkLogic Server instance and user.
  2. Set -output_host, -output_port, -output_username, and -output_password to identify the destination MarkLogic Server instance and user.
  3. Select what documents to copy. For details, see Filtering Archive and Copy Contents.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select all documents in the database, leave -collection_filter and -directory_filter unset.
  4. If you want to exclude some or all source document metadata:
    • Set -copy_collections to false to exclude document collections metadata.
    • Set -copy_permissions to false to exclude document permissions metadata.
    • Set -copy_properties to false to exclude document properties.
    • Set -copy_quality to false to exclude document quality metadata.
  5. If you want to add or override document metadata in the destination database:
    • Set -output_collections to add destination documents to a collection.
    • Set -output_permissions to add permissions to destination documents.
    • Set -output_quality to set the quality of destination documents.
  6. If you want the destination documents to have database URIs different from the source URIs, set -output_uri_replace, -output_uri_prefix, and/or -output_uri_suffix. For details, see Controlling Database URIs During Ingestion.

For a complete list of mlcp copy command options, see Copy Options.

The following example copies all documents and their metadata from the source database to the destination database:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh copy -mode local -input_host srchost -input_port 8006 \
    -input_username user1 -input_password password1 \
    -output_host desthost -output_port 8010 -output_username user2 \
    -output_password password2

The following example copies selected documents, excluding the source permissions and adding the documents to 2 new collections in the destination database:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh copy -mode local -input_host srchost -input_port 8006 \
    -input_username user1 -input_password password1 \
    -output_host desthost -output_port 8010 -output_username user2 \
    -output_password password2 -copy_permissions false \
    -output_collections shakespeare,plays

Command Line Reference

This section covers the following topics:

Command Line Summary

The mlcp command line has the following structure:

  • Linux, Solaris, and OS X: mlcp.sh command options
  • Windows: mlcp.bat command options

Where command is one of the commands in the table below. Each command has a set of command-specific options.

Command Description
import Import data from the file system, the Hadoop Distributed File System (HDFS), or standard input to a MarkLogic Server database. See Import Options.
export Export data from a MarkLogic Server database to the file system or HDFS. See Export Options.
copy Copy data from one MarkLogic Server database to another. See Copy Options.
help Display brief help about mlcp. To get a list command line options for a particular command, use mlcp.sh command -help.

In addition to the command-specific options, mlcp enables you to pass additional settings to Hadoop MapReduce when using -mode distributed. This feature is for advanced users who are familiar with MapReduce. For details, see Setting Custom Hadoop Options and Properties.

If you use Hadoop-specific options such as -conf or -D, they must appear after -options_file (if present) and before any mlcp-specific options.

Options can also be specified in an options file using -options_file. Options files and command line options can be used together. For details, see Options File Syntax.

Note the following conventions for command line options to mlcp:

  • Prefix options with a single dash (-).
  • Option names are case-sensitive.
  • If an option has a value, separate the option name and value with whitespace. For example: mlcp import -username admin
  • If an option has a predefined set of possible values, such as -mode, the option values are case-insensitive unless otherwise noted.
  • If an option appears more than once on the command line, the first occurrence is used.
  • When string option values require quoting, use single quotes. For example: -output_uri_replace "this,'that '".
  • The value of a boolean typed option can be omitted. If the value is omitted, true is implied. For example, -copy_collections is equivalent to -copy_collections true.

Setting Java Virtual Machine (JVM) Options

The mlcp tool is a Java application. You can pass extra parameters to the JVM during an mlcp command using the environment variable JVM_OPTS.

For example, the following command passes the setting '-Xmx100M' to the JVM to increase the JVM heap size for a single mclp run:

$ JVM_OPTS='-Xmx100M' mclp.sh import ...

Import Options

This section summarizes the command line options available with the mlcp import command. For examples and use cases, see Importing Content Into MarkLogic Server.

The following command line options are required when using the import command:

Option Description
-host string Hostname of the destination MarkLogic Server.
-port number Port number of the destination MarkLogic Server. There should be an XDBC App Server on this port, listening on behalf of the target database. The App Server must not be SSL-enabled.
-username string MarkLogic Server user with which to import documents.
-password string Password for the MarkLogic Server user specified with -username.

The following table lists additional command line options:

Option Description
-aggregate_record_element string When splitting an aggregate input file into multiple documents, the name of the element to use as the output document root. Default: The first child element under the root element.
-aggregate_record_namespace string The namespace of the element specificed by -aggregate_record_element_name. Default: No namespace.
-aggregate_uri_id string When splitting an aggregate input file into multiple documents, the element or attribute name within the document root to use as the document URI. Default: In local mode, hashcode-seqnum, where the hashcode is derived from the split number; in distribute mode, taskid-seqnum.
-batch_size number The number of documents to process in a single request to MarkLogic Server. Default: 100.
-conf filename Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-content_encoding string The character encoding of input documents when -input_file_type is documents, aggregates, or delimited_text. The option value must be a character set name accepted by your JVM; see java.nio.charset.Charset. Default: The platform default encoding for the host on which mlcp runs.
-copy_collections boolean When importing documents from an archive, whether to copy document collections from the source archive to the destination. Only applies with -input_file_type archive. Default: true.
-copy_permissions boolean When importing documents from an archive, whether to copy document permissions from the source archive to the destination. Only applies with -input_file_type archive. Default: true.
-copy_properties boolean When importing documents from an archive, whether to copy document properties from the source archive to the destination. Only applies with -input_file_type archive. Default: true.
-copy_quality boolean When importing documents from an archive, whether to copy document quality from the source archive to the destination. Only applies with -input_file_type archive. Default: true.
-D property=value Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-delimiter character When importing content with -input_file_type delimited_text, the delimiting character. Default: comma (,).
-delimited_uri_id string When importing content -input_file_type delimited_text, the column name that contributes to the id portion of the URI for inserted documents. Default:
-document_type string The type of document to create when -input_file_type is documents or sequencefile. Accepted values: mixed (documents only), xml, text, binary. Default: mixed for documents, xml for sequencefile.
-fastload boolean Whether or not to force optimal performance, even at the risk of creating duplicate document URIs. See Time vs. Correctness: Using Direct Forest Updates. Default: false.
-hadoop_conf_dir string When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode.
-archive_metadata_optional boolean When importing documents from a database archive, whether or not to ignore missing metadata files. If this is false and the archive contains no metadata, an error occurs. Default: false.
-input_compressed boolean Whether or not the source data is compressed. Default: false.
-input_compression_codec string When -input_compressed is true, the code used for compression. Accepted values: zip, gzip.
-input_file_path string A regular expression describing the filesystem location(s) to use for input. For details, see Regular Expression Syntax.
-input_file_pattern string Load only input files that match this regular expression from the path(s) matched by -input_file_path. For details, see Regular Expression Syntax. Default: Load all files.
-input_file_type type The input file type. Accepted value: aggregates, archive, delimited_text, documents, sequencefile. Default: documents.
-sequencefile_key_class string When importing Hadoop sequence files, the name of the Java class to use as the input key. Required when using sequence files.
-sequencefile_value_class string When importing Hadoop sequence files, the name of the Java class to use as the input value. Required when using sequence files.
-sequencefile_value_type string When importing Hadoop sequence files, the type of the value data returned by the class named by -sequencefile_value_class. Accepted values: Text, BytesWritable. (Values are case-insensitive). Default: Text.
-max_split_size number When importing from files, the maximum number of bytes in one input split. Default: The maximum Long value (Long.MAX_VALUE).
-min_split_size number When importing from files, the minimum number of bytes in one input split. Default: 0.
-mode string Ingestion mode. Accepted values: distributed, local. Distributed mode requires Hadoop. Default: local.
-namespace string The default namespace for all XML documents created during loading.
-options_file string Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_cleandir boolean Whether or not to delete all content in the output database directory prior to loading. Default: false.
-output_collections comma-list A comma separated list of collection URIs. Loaded documents are added to these collections.
-output_directory string The destination database directory in which to create the loaded documents. Using this option enables -fastload by default, which can cause duplicate URIs to be created. See Time vs. Correctness: Using Direct Forest Updates.
-filename_as_collection boolean Add each loaded document to a collection corresponding to the name of the input file. Useful when splitting an input file into multiple documents. If the filename contains characters not permitted in a URI, those characters are URI encoded. Default: false.
-output_language string The xml:lang to associate with loaded documents.
-output_permissions comma-list A comma separated list of (role,capability) pairs to apply to loaded documents. Default: The default permissions associated with the user inserting the document. Example: -output_permissions role1,read,role2,update
-output_quality string The quality of loaded documents. Default: 0.
-output_uri_prefix string URI prefix to the id specified by -output_idname. Used to construct output document URIs.
-output_uri_replace comma-list A comma separated list of (regex,string) pairs that define string replacements to apply to the URIs of documents added to the database. The replacement strings must be enclosed in single quotes. For example, -output_uri_replace "regex1,'string1',regext2,'string2'"
-output_uri_suffix string URI suffix to the id specified by -output_idname. Used to construct output document URIs.
-streaming boolean Whether or not to stream documents to MarkLogic Server. Applies only when -input_file_type is documents.
-thread_count number The number of threads to spawn for concurrent loading. Only available in local mode. Default: 4.
-tolerate_errors boolean Applicable only when -batch_size is greater than 1. When this option is true and batch size is greater than 1, if an error occurs for one or more documents during loading, only the erroneous documents are skipped; all other documents are inserted into the database. When this option is false or batch size is > 1, errors during insertion can cause all the inserts in the current batch to be rolled back. Default: false.
-transaction_size number The number of requests to MarkLogic Server per transaction. Default: 10.
-xml_repair_level string The degree of repair to attempt on XML documents in order to create well-formed XML. Accepted values: default, full, none. Default: default, which depends on the configured MarkLogic Server default XQuery version: In XQuery 1.0 and 1.0-ml the default is none. In XQuery 0.9-ml the default is full.

Export Options

This section summarizes the command line options available with the mlcp export command. For examples and use cases, see Exporting Content from MarkLogic Server.

The following command line options are required when using the export command:

Option Description
-host string Hostname of the source MarkLogic Server.
-port number Port number of the source MarkLogic Server. There should be an XDBC App Server on this port, listening on behalf of the target database. The App Server must not be SSL-enabled.
-username string MarkLogic Server user with which to export documents.
-password string Password for the MarkLogic Server user specified with -username.

The following table lists optional command line options:

Option Description
-collection_filter comma-list A comma-separated list of collection URIs. mlcp exports only documents in these collections, plus related metadata. This option may not be combined with -directory_filter. Default: All documents and related metadata.
-compress boolean Whether or not to compress the output document. Only applicable when -output_type is document. Default: false.
-conf filename Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-copy_collections boolean When exporting documents to an archive, whether or not to copy collections to the destination. Default: true.
-copy_permissions boolean When exporting documents to an archive, whether or not to copy document permissions to the destination. Default: true.
-copy_properties boolean When exporting documents to an archive, whether or not to copy properties to the destination. Default: true.
-copy_quality boolean When exporting documents to an archive, whether or not to copy document quality to the destination. Default: true.
-D property=value Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-directory_filter comma-list A comma-separated list of database directory names. mlcp exports only documents from these directories, plus related metadata. Directory names should usually end with '/'. This option may not be combined with -collection_filter. Default: All documents and related metadata.
-hadoop_conf_dir string When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode.
-indented boolean Whether to pretty-print XML output. Default: false.
-max_split_size number The maximum number of document fragments processed per split. Only applicable in distributed mode. Default: 50000.
-mode string Export mode. Accepted values: distributed, local. Distributed mode requires Hadoop. Default: local.
-options_file string Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_file_path string Destination directory where the archive or documents are saved. The directory must not already exist.
-output_type string The type of output to produce. Accepted values: document, archive. Default: document.
-thread_count number The number of threads to spawn for concurrent exporting. Only available in local mode. Default: 4.

Copy Options

This section summarizes the command line options available with the mlcp copy command. For examples and use cases, see Copying Content Between Databases.

The following command line options are required when using the copy command:

Option Description
-input_host string Hostname of the source MarkLogic Server.
-input_port number Port number of the source MarkLogic Server.There should be an XDBC App Server on this port, listening on behalf of the source database. The App Server must not be SSL-enabled.
-input_username string MarkLogic Server user with which to export documents.
-input_password string Password for the MarkLogic Server user specified with -input_username.
-output_host string Hostname of the destination MarkLogic Server.
-output_port number Port number of the destination MarkLogic Server.There should be an XDBC App Server on this port, listening on behalf of the destination database. The App Server must not be SSL-enabled.
-output_username string MarkLogic Server user with which to import documents to the destination.
-output_password string Password for the MarkLogic Server user specified with -output_username.

The following table lists optional command line options:

Option Description
-batch_size number The number of documents to load per request to MarkLogic Server. Default: 100.
-collection_filter comma-list A comma-separated list of collection URIs. mlcp exports only documents in these collections, plus related metadata. Directory names should usually end with '/'. This option may not be combined with -directory_filter. Default: All documents and related metadata.
-conf filename Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-copy_collections boolean Whether to copy document collections from the source database to the destination database. Default: true.
-copy_permissions boolean Whether to copy document permissions from the source database to the destination database. Default: true.
-copy_properties boolean Whether to copy document properties from the source database to the destination database. Default: true.
-copy_quality boolean Whether to copy document quality from the source database to the destination database. Default: true.
-D property=value Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-directory_filter comma-list A comma-separated list of database directories. mlcp exports only documents from these directories, plus related metadata. This option may not be combined with -collection_filter. Default: All documents and related metadata.
-fastload boolean Whether or not to force optimal performance, even at the risk of creating duplicate document URIs. See Time vs. Correctness: Using Direct Forest Updates. Default: false.
-hadoop_conf_dir string When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode.
-max_split_size number The maximum number of document fragments processed per split. Only applicable in distributed mode. Default: 50000.
-mode string Copy mode. Accepted values: distributed, local. Distributed mode requires Hadoop. Default: local.
-options_file string Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_collections comma-list A comma separated list of collection URIs. Output documents are added to these collections.
-output_permissions comma-list A comma separated list of (role,capability) pairs to apply to loaded documents. Default: The default permissions associated with the user inserting the document. Example: -output_permissions role1,read,role2,update
-output_quality string The quality to assign to output documents.
-output_uri_prefix string URI prefix to the id specified by -output_idname. Used to construct output document URIs.
-output_uri_replace comma-list A comma separated list of (regex,string) pairs that define string replacements to apply to the URIs of documents added to the database. The replacement strings must be enclosed in single quotes. For example, -output_uri_replace "regex1,'string1',regext2,'string2'"
-output_uri_suffix string URI suffix to the id specified by -output_idname. Used to construct output document URIs.
-thread_count number The number of threads to spawn for concurrent copying. Only available in local mode. Default: 4.
-transaction_size number When loading documents into the destination database, the number of requests to MarkLogic Server in one transaction. Default: 10.

Regular Expression Syntax

For -input_file_path, use the regular expression syntax outlined here:

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

For all other options that use regular expressions, such as -input_file_pattern, use the Java regular expression language. Java's pattern language is similar to the Perl pattern language. For details on the grammar, see the documentation for the Java class java.util.regex.Pattern:

http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

For a tutorial on the expression language, see http://docs.oracle.com/javase/tutorial/essential/regex/.

Options File Syntax

You can specify mlcp options using an options file, in addition to using command line options by using -options_file. If you use an options file, it must be the first option on the command line. The mlcp command (import, export, copy) can also go inside the options file. For example:

$ mlcp.sh -options_file my_options.txt -input_file_path /example

An options file has the following contents:

  • Each line contains either a command name, an option, or an option value, ordered as they would appear on the command line.
  • Comments begin with '#' and must be on a line by themselves.
  • Blank lines, leading whitespace, and trailing whitespace are ignored.

For example, if you frequently use the same MarkLogic Server connection information (host, port, username, and password), you can put the this information into an options file:

$ cat my-conn.txt
# my connection info
-host 
localhost
-port 
8006
-username 
me
-password
my_password
# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -options_file my-conn.txt \
    -input_file_path /space/examples/all.zip

This is equivalent to the following command line without an options file:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8006 -username me \
    -password my_password -input_file_path /space/examples/all.zip

You can also include a command name (import, export, or copy) as the first non-comment line in an options file:

# my connection info for import
import
-host 
localhost
-port 
8006
-username 
me
-password
my_password

« Previous chapter
Next chapter »