MarkLogic Content Pump (mlcp) is a command line tool for getting data into and out of a MarkLogic Server database. This chapter covers the following topics:
Using mlcp, you can import documents and metadata to a database, export documents and metadata from a database, or copy documents and metadata from one database to another. For example:
The mlcp tool has two modes of operation:
Local mode is the default unless you configure your environment or mlcp command line as described in Configuring Distributed Mode. Distributed mode requires a Hadoop installation.
To understand the difference between the two modes, consider the following: When loading documents in local mode, all the input data must be reachable from the host on whichyou run mlcp, and all communication with MarkLogic Server is through that host. Throughput is limited by resources such as memory and network bandwidth available to the host running mlcp. When loading documents in distributed mode, multiple nodes in a Hadoop cluster communicate with MarkLogic Server, so greater concurrency can be achieved, while placing fewer resource demands on any one host.
You can use mlcp even when a load balancer sits between the client host and the MarkLogic host. The mlcp tool is compatible with AWS Elastic Load Balancer (ELB) and other load balancers.
You should be familiar with the following terms and definitions when using mlcp:
Term | Definition |
---|---|
aggregate | XML content that includes recurring element names and which can be split into multiple documents with the recurring element as the document root. For details, see Splitting Large XML Files Into Multiple Documents. |
line-delimited JSON | A type of aggregate input where each line in the file is a piece of standalone JSON content. For details, see Creating Documents from Line-Delimited JSON Files. |
archive | A compressed MarkLogic Server database archive created using the mlcp export command. You can use an archive to restore or copy database content and metadata with the mlcp import command. For details, see Exporting to an Archive. |
HDFS | The Hadoop Distributed File System, which can be used as an input source or an output destination in distributed mode. |
sequence file | A flat file of binary key-value pairs in one of the Apache Hadoop SequenceFile formats. The mlcp tool only supports importing Text and BytesWritable values from a sequence file. |
split | The unit of work for one thread in local mode or one MapReduce task in distributed mode. |
All the examples in this guide use Unix command line syntax. If you are using mlcp with the Windows command interpreter, Cmd.exe
, use the following guidelines to construct equivalent commands:
mlcp.sh
with mlcp.bat
. You should always use mlcp.bat
on Windows; using mlcp.sh
with Cygwin is not supported.For example, the following Unix command line:
$ mlcp.sh import -host localhost -port 8000 -username user \ -password passwd -input_file_path /space/bill/data -mode local \ -output_uri_replace "/space,'',/bill/data/,'/will/'" \ -output_uri_prefix /plays
Corresponds to this Windows command line:
C:\Example> mlcp.bat import -host localhost -port 8000 -username user ^ -password passwd -input_file_path c:\space\bill -mode local ^ -output_uri_replace "/c:/space,'',/bill/data/,'/will/'" ^ -output_uri_prefix /plays
This section covers the following key concepts and tasks related to the mlcp command line:
The mlcp command line has the following structure. Note that you should always use mlcp.bat
on Windows; using mlcp.sh
with Cygwin is not supported.
Where command is one of the commands in the table below. Each command has a set of command-specific options, which are covered in the chapter that discusses the command.
Command | Description |
---|---|
import | Import data from the file system, the Hadoop Distributed File System (HDFS), or standard input to a MarkLogic Server database. For a list of options usable with this command, see Import Command Line Options. |
export | Export data from a MarkLogic Server database to the file system or HDFS. For a list of options usable with this command, see Export Command Line Options. |
copy | Copy data from one MarkLogic Server database to another. For a list of options usable with this command, see Copy Command Line Options. |
extract | Use Direct Access to extract files from a forest file to documents on the native file system or HDFS. For a list of options usable with this command, see Extract Command Line Options. |
version | Report mlcp runtime environment version information, including the mlcp, JRE, and Hadoop versions, as well as the supported MarkLogic version. |
help | Display brief help about mlcp. |
In addition to the command-specific options, mlcp enables you to pass additional settings to Hadoop MapReduce when using -mode distributed
. This feature is for advanced users who are familiar with MapReduce. For details, see Setting Custom Hadoop Options and Properties.
If you use Hadoop-specific options such as -conf
or -D
, they must appear after -options_file
(if present) and before any mlcp-specific options.
Options can also be specified in an options file using -options_file
. Options files and command line options can be used together. For details, see Options File Syntax.
Note the following conventions for command line options to mlcp:
mlcp import -username admin
-mode
, the option values are case-insensitive unless otherwise noted.-output_uri_replace "this,'that '"
.-copy_collections
is equivalent to -copy_collections true
.The mlcp tool is a Java application. You can pass extra parameters to the JVM during an mlcp command using the environment variable JVM_OPTS
.
For example, the following command passes the setting -Xmx100M to the JVM to increase the JVM heap size for a single mclp run:
$ JVM_OPTS='-Xmx100M' mclp.sh import ...
For -input_file_path
, use the regular expression syntax outlined here:
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)
For all other options that use regular expressions, such as -input_file_pattern
, use the Java regular expression language. Java's pattern language is similar to the Perl pattern language. For details on the grammar, see the documentation for the Java class java.util.regex.Pattern
:
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
For a tutorial on the expression language, see http://docs.oracle.com/javase/tutorial/essential/regex/.
You can specify mlcp options using an options file, in addition to using command line options by using -options_file
. Using an options file is especially convenient when working with options whose values contain quotes and other special characters that are difficult to escape on the command line.
If you use an options file, it must be the first option on the command line. The mlcp command (import, export, copy) can also go inside the options file. For example:
$ mlcp.sh -options_file my_options.txt -input_file_path /example
An options file has the following contents:
For example, if you frequently use the same MarkLogic Server connection information (host, port, username, and password), you can put the this information into an options file:
$ cat my-conn.txt # my connection info -host localhost -port 8000 -username me -password my_password # Windows users, see Modifying the Example Commands for Windows $ mlcp.sh import -options_file my-conn.txt \ -input_file_path /space/examples/all.zip
This is equivalent to the following command line without an options file:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh import -host localhost -port 8000 -username me \ -password my_password -input_file_path /space/examples/all.zip
You can also include a command name (import, export, or copy) as the first non-comment line in an options file:
# my connection info for import import -host localhost -port 8000 -username me -password my_password
When mlcp exits, it returns one of the following status codes:
Exit Code | Meaning |
---|---|
0 | Successful completion. |
-1 | The job is still running. |
1 | The job failed. |
2 | The job is in the preparation state. |
3 | The job was terminated prematurely. |
Unless otherwise noted, mlcp is compatible with a wide range of MarkLogic versions. That is, you can usually use a recent version of mlcp with and older version of MarkLogic and vice versa. However, not all features of mlcp or MarkLogic will work across version boundaries.
For example, MarkLogic 9 and mlcp 9.0 include support for redacting documents as you export them. However, older versions of MarkLogic do not support this feature, so it is not possible to use the -redaction
option of mlcp with older versions.
Similarly, you can use mlcp to export a database archive from MarkLogic 9 or later that includes documents with the node-update
security capability. However, this capability did not exist in earlier versions of MarkLogic, so it cannot be preserved if you import the MarkLogic 9 archive into an older MarkLogic, and may even cause errors.
For best results, use the version of mlcp that corresponds to your version of MarkLogic, or limit your jobs to features you know are supported in both.
The mlcp tool is developed and maintained as an open source project on GitHub. To access the sources or contribute to the project, navigate to the following URL in your browser:
http://github.com/marklogic/marklogic-contentpump