mlcp User Guide (PDF)

MarkLogic 9 Product Documentation
mlcp User Guide
— Chapter 1

« Table of contents
Next chapter »

Introduction to MarkLogic Content Pump

MarkLogic Content Pump (mlcp) is a command line tool for getting data into and out of a MarkLogic Server database. This chapter covers the following topics:

Feature Overview

Using mlcp, you can import documents and metadata to a database, export documents and metadata from a database, or copy documents and metadata from one database to another. For example:

The mlcp tool has two modes of operation:

  • Local: mlcp drives all its work on the host where it is invoked. Resources such as import data and export destination must be reachable from that host.
  • Distributed: mlcp distributes its workloads across the nodes in a Hadoop cluster. Resources such as import data and export destination must be reachable from the cluster, which usually means via HDFS.

Local mode is the default unless you configure your environment or mlcp command line as described in Configuring Distributed Mode. Distributed mode requires a Hadoop installation.

To understand the difference between the two modes, consider the following: When loading documents in local mode, all the input data must be reachable from the host on whichyou run mlcp, and all communication with MarkLogic Server is through that host. Throughput is limited by resources such as memory and network bandwidth available to the host running mlcp. When loading documents in distributed mode, multiple nodes in a Hadoop cluster communicate with MarkLogic Server, so greater concurrency can be achieved, while placing fewer resource demands on any one host.

You can use mlcp even when a load balancer sits between the client host and the MarkLogic host. The mlcp tool is compatible with AWS Elastic Load Balancer (ELB) and other load balancers.

Terms and Definitions

You should be familiar with the following terms and definitions when using mlcp:

Term Definition
aggregate XML content that includes recurring element names and which can be split into multiple documents with the recurring element as the document root. For details, see Splitting Large XML Files Into Multiple Documents.
line-delimited JSON A type of aggregate input where each line in the file is a piece of standalone JSON content. For details, see Creating Documents from Line-Delimited JSON Files.
archive A compressed MarkLogic Server database archive created using the mlcp export command. You can use an archive to restore or copy database content and metadata with the mlcp import command. For details, see Exporting to an Archive.
HDFS The Hadoop Distributed File System, which can be used as an input source or an output destination in distributed mode.
sequence file A flat file of binary key-value pairs in one of the Apache Hadoop SequenceFile formats. The mlcp tool only supports importing Text and BytesWritable values from a sequence file.
split The unit of work for one thread in local mode or one MapReduce task in distributed mode.

Modifying the Example Commands for Windows

All the examples in this guide use Unix command line syntax. If you are using mlcp with the Windows command interpreter, Cmd.exe, use the following guidelines to construct equivalent commands:

  • Replace mlcp.sh with mlcp.bat. You should always use mlcp.bat on Windows; using mlcp.sh with Cygwin is not supported.
  • For aesthetic reasons, long example command lines are broken into multiple lines using the Unix line continuation character \. On Windows, remove the line continuation characters and place the entire command on one line, or replace the line continuation characters with the Windows equivalent, ^.
  • Replace option arguments enclosed in single quotes (') with double quotes ("). If the single-quoted string contains embedded double quotes, escape the inner quotes.
  • Escape any unescaped characters that have special meaning to the Windows command interpreter.

For example, the following Unix command line:

$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -input_file_path /space/bill/data -mode local \
    -output_uri_replace "/space,'',/bill/data/,'/will/'" \
    -output_uri_prefix /plays

Corresponds to this Windows command line:

C:\Example> mlcp.bat import -host localhost -port 8000 -username user ^
    -password passwd -input_file_path c:\space\bill -mode local ^
    -output_uri_replace "/c:/space,'',/bill/data/,'/will/'" ^
    -output_uri_prefix /plays

Understanding the mlcp Command Line

This section covers the following key concepts and tasks related to the mlcp command line:

Command Line Summary

The mlcp command line has the following structure. Note that you should always use mlcp.bat on Windows; using mlcp.sh with Cygwin is not supported.

  • Linux and OS X: mlcp.sh command options
  • Windows: mlcp.bat command options

Where command is one of the commands in the table below. Each command has a set of command-specific options, which are covered in the chapter that discusses the command.

Command Description
import Import data from the file system, the Hadoop Distributed File System (HDFS), or standard input to a MarkLogic Server database. For a list of options usable with this command, see Import Command Line Options.
export Export data from a MarkLogic Server database to the file system or HDFS. For a list of options usable with this command, see Export Command Line Options.
copy Copy data from one MarkLogic Server database to another. For a list of options usable with this command, see Copy Command Line Options.
extract Use Direct Access to extract files from a forest file to documents on the native file system or HDFS. For a list of options usable with this command, see Extract Command Line Options.
version Report mlcp runtime environment version information, including the mlcp, JRE, and Hadoop versions, as well as the supported MarkLogic version.
help Display brief help about mlcp.

In addition to the command-specific options, mlcp enables you to pass additional settings to Hadoop MapReduce when using -mode distributed. This feature is for advanced users who are familiar with MapReduce. For details, see Setting Custom Hadoop Options and Properties.

If you use Hadoop-specific options such as -conf or -D, they must appear after -options_file (if present) and before any mlcp-specific options.

Options can also be specified in an options file using -options_file. Options files and command line options can be used together. For details, see Options File Syntax.

Note the following conventions for command line options to mlcp:

  • Prefix options with a single dash (-).
  • Option names are case-sensitive.
  • If an option has a value, separate the option name and value with whitespace. For example: mlcp import -username admin
  • If an option has a predefined set of possible values, such as -mode, the option values are case-insensitive unless otherwise noted.
  • If an option appears more than once on the command line, the first occurrence is used.
  • When string option values require quoting, use single quotes. For example: -output_uri_replace "this,'that '".
  • The value of a boolean typed option can be omitted. If the value is omitted, true is implied. For example, -copy_collections is equivalent to -copy_collections true.

Setting Java Virtual Machine (JVM) Options

The mlcp tool is a Java application. You can pass extra parameters to the JVM during an mlcp command using the environment variable JVM_OPTS.

For example, the following command passes the setting -Xmx100M to the JVM to increase the JVM heap size for a single mclp run:

$ JVM_OPTS='-Xmx100M' mclp.sh import ...

Regular Expression Syntax

For -input_file_path, use the regular expression syntax outlined here:

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

For all other options that use regular expressions, such as -input_file_pattern, use the Java regular expression language. Java's pattern language is similar to the Perl pattern language. For details on the grammar, see the documentation for the Java class java.util.regex.Pattern:

http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

For a tutorial on the expression language, see http://docs.oracle.com/javase/tutorial/essential/regex/.

Options File Syntax

You can specify mlcp options using an options file, in addition to using command line options by using -options_file. Using an options file is especially convenient when working with options whose values contain quotes and other special characters that are difficult to escape on the command line.

If you use an options file, it must be the first option on the command line. The mlcp command (import, export, copy) can also go inside the options file. For example:

$ mlcp.sh -options_file my_options.txt -input_file_path /example

An options file has the following contents:

  • Each line contains either a command name, an option, or an option value, ordered as they would appear on the command line.
  • Comments begin with # and must be on a line by themselves.
  • Blank lines, leading whitespace, and trailing whitespace are ignored.

For example, if you frequently use the same MarkLogic Server connection information (host, port, username, and password), you can put the this information into an options file:

$ cat my-conn.txt
# my connection info
-host 
localhost
-port 
8000
-username 
me
-password
my_password
# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -options_file my-conn.txt \
    -input_file_path /space/examples/all.zip

This is equivalent to the following command line without an options file:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username me \
    -password my_password -input_file_path /space/examples/all.zip

You can also include a command name (import, export, or copy) as the first non-comment line in an options file:

# my connection info for import
import
-host 
localhost
-port 
8000
-username 
me
-password
my_password

mlcp Exit Status Codes

When mlcp exits, it returns one of the following status codes:

Exit Code Meaning
0 Successful completion.
-1 The job is still running.
1 The job failed.
2 The job is in the preparation state.
3 The job was terminated prematurely.

Compatibility of mlcp Across MarkLogic Versions

Unless otherwise noted, mlcp is compatible with a wide range of MarkLogic versions. That is, you can usually use a recent version of mlcp with and older version of MarkLogic and vice versa. However, not all features of mlcp or MarkLogic will work across version boundaries.

For example, MarkLogic 9 and mlcp 9.0 include support for redacting documents as you export them. However, older versions of MarkLogic do not support this feature, so it is not possible to use the -redaction option of mlcp with older versions.

Similarly, you can use mlcp to export a database archive from MarkLogic 9 or later that includes documents with the node-update security capability. However, this capability did not exist in earlier versions of MarkLogic, so it cannot be preserved if you import the MarkLogic 9 archive into an older MarkLogic, and may even cause errors.

For best results, use the version of mlcp that corresponds to your version of MarkLogic, or limit your jobs to features you know are supported in both.

Accessing the mlcp Source Code

The mlcp tool is developed and maintained as an open source project on GitHub. To access the sources or contribute to the project, navigate to the following URL in your browser:

http://github.com/marklogic/marklogic-contentpump

« Table of contents
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy