mlcp User Guide (PDF)

MarkLogic 9 Product Documentation
mlcp User Guide
— Chapter 6

« Previous chapter
Next chapter »

Copying Content Between Databases

Use the mlcp copy command to copy content and associated metadata from one MarkLogic Server database to another when both are reachable on the network. You can also copy data from offline forests to a MarkLogic Server database; for details, see Using Direct Access to Extract or Copy Documents.

This chapter includes the following topics:

Basic Steps

To copy one database to another with mclp:

  1. Set -input_host, -input_port, -input_username, and -input_password to identify the source MarkLogic Server instance and user.
  2. Set -output_host, -output_port, -output_username, and -output_password to identify the destination MarkLogic Server instance and user.
  3. Select what documents to copy. For details, see Filtering Archive and Copy Contents.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select documents matching an XPath expression, use -document_selector. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace.
    • To select document matching a query, use -query_filter. You can use this option alone or in combination with a directory, collection or document selector filter. False positives are possible; for details, see Understanding When Filters Are Accurate.
    • To select all documents in the database, leave -collection_filter, -directory_filter, -document_selector, and -query_filter unset.
  4. If you want to exclude some or all source document metadata:
    • Set -copy_collections to false to exclude document collections metadata.
    • Set -copy_permissions to false to exclude document permissions metadata.
    • Set -copy_properties to false to exclude document properties.
    • Set -copy_quality to false to exclude document quality metadata.
    • Set -copy_metadata to false to exclude document key-value metadata.
  5. If you want to add or override document metadata in the destination database:
    • Set -output_collections to add destination documents to a collection.
    • Set -output_permissions to add permissions to destination documents.
    • Set -output_quality to set the quality of destination documents.
  6. If you want the destination documents to have database URIs different from the source URIs, set -output_uri_replace, -output_uri_prefix, and/or -output_uri_suffix. For details, see Controlling Database URIs During Ingestion.

For a complete list of mlcp copy command options, see Copy Command Line Options.

Examples

The following example copies all documents and their metadata from the source database to the destination database:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh copy -mode local -input_host srchost -input_port 8000 \
    -input_username user1 -input_password password1 \
    -output_host desthost -output_port 8010 -output_username user2 \
    -output_password password2

The following example copies selected documents, excluding the source permissions and adding the documents to 2 new collections in the destination database:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh copy -mode local -input_host srchost -input_port 8000 \
    -input_username user1 -input_password password1 \
    -output_host desthost -output_port 8000 -output_username user2 \
    -output_password password2 -copy_permissions false \
    -output_collections shakespeare,plays

For an example of using -query_filter, see Example: Exporting Documents Matching a Query.

Redacting Content During a Copy

Redaction is the process of eliminating or obscuring portions of a document when retrieving the document from MarkLogic. For example, you can eliminate or mask sensitive personal information such as credit card numbers, phone numbers, or email addresses from documents. You can only redact document content, not document properties.

Redaction is performed as documents are read from the source database. For example, if you copy documents between databases in two different MarkLogic installations, the unredacted content never leaves the source installation.

Redaction support in MarkLogic is covered in detail in Redacting Content During Export or Copy Operations and Redacting Document Content in the Application Developer's Guide.

Use the -redaction option to apply redaction rules during a copy. For example, the following command copies documents in the my_docs collection from one database to another, and applies the redaction rules in the rule collections hipaa-rules and biz-rules to the source documents before copying them to the destination database.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh copy -mode local -input_host srchost -input_port 8000 \
    -input_username user1 -input_password password1 \
    -output_host desthost -output_port 8000 -output_username user2 \
    -output_password password2 -collection_filter my_docs \
    -redaction "hipaa-rules,biz-rules"

For more details, see Redacting Content During Export or Copy Operations.

Advanced Document Selection for Copy

The mlcp tool uses the MarkLogic Connector for Hadoop to distribute work across your MarkLogic cluster, even when run in local mode. When you use the mlcp copy command, the source MarkLogic Server instance acts as an input source for a Hadoop MapReduce job. Similarly, the destination MarkLogic Server instance acts as the output sink for the job. You can take low level control of the job by setting connector and Hadoop configuration properties.

This is an advanced technique. You should understand how to use the MarkLogic Connector for Hadoop before attempting this.

The following list describes some use cases in which you might choose to set low level configuration properties:

Similar use cases and techniques apply to export operations. For details, see Advanced Document Selection and Transformation.

The following table lists some connector and Hadoop configuration properties relevant to advanced configuration for copy.

Configuration Property Description
mapreduce.marklogic.input.mode Controls whether the connectors runs in basic or advanced mode. Set to advanced.
mapreduce.marklogic.input.splitquery A query that generates input splits. This distributes the work required to extract documents from the source database. The query can be either XQuery or Server-Side JavaScript.
mapreduce.marklogic.input.query A query that selects the input fragments to extract from the source database. You can use the input query to apply server-side transformations to each output item. The query can be either XQuery or Server-Side JavaScript.
mapreduce.inputformat.class

This property identifies a subclass of the connector InputFormat class, describing the type of the values produced by your input query.

You can create your own InputFormat subclass, but most applications will use one of the classes defined by the connector, such as DocumentInputFormat, which is the default used by mlcp.

mapreduce.outputformat.class This property identifies a subclass of the connector OutputFormat class, describing the type of input for the destination database. In most cases, you should use ContentOutputFormat.
mapreduce.map.class Optional. This property identifies a subclass of org.apache.hadoop.mapreduce.Mapper. Defaults to com.marklogic.contentpump.DocumentMapper, but you override if for more advanced use cases.

When you take low-level control of a copy operation, you can no longer use options such as -copy_collections, -copy_permissions, and -copy_properties to copy the various categories of metadata from the source database to the destination database. If you include the -copy_* options on the mlcp command line, they will be ignored.

You can pass a connector configuration file through mlcp with the -conf option. The -conf option must appear after -options_file (if present) and before any other mlcp options. The following example command demonstrates using the -conf option in a copy operation.

$ mlcp.sh copy -conf conf.xml -input_host srchost -input_port 8000 \
    -input_username user -input_password password \
    -output_host desthost -output_port 8000 \
    -output_username user -output_password password \
    -mode local

The following example connector configuration file includes an XQuery split query that selects documents from a specific collection (similar to what the -collection_filter option does), and an XQuery input query that selects specific elements of each document.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapreduce.job.inputformat.class</name>
    <value>com.marklogic.mapreduce.DocumentInputFormat</value>
  </property>
  <property>
    <name>mapreduce.job.outputformat.class</name>
    <value>com.marklogic.mapreduce.ContentOutputFormat</value>
  </property>
  <property>
    <name>mapreduce.marklogic.input.mode</name>
    <value>advanced</value>
  </property>
  <property>
    <name>mapreduce.marklogic.input.splitquery</name>
    <value><![CDATA[
xquery version "1.0-ml";
declare namespace wp="http://www.mediawiki.org/xml/export-0.4/";
import module namespace admin ="http://marklogic.com/xdmp/admin" 
  at "/MarkLogic/admin.xqy";
let $conf := admin:get-configuration()
for $forest in xdmp:database-forests(xdmp:database())
let $host_id :=admin:forest-get-host($conf,$forest)
let $host_name := admin:host-get-name($conf,$host_id)
let $cnt := xdmp:estimate(
  cts:search(fn:collection("mycoll"),
             cts:and-query(()),(),0.0,$forest))
return
($forest,$cnt,$host_name)
    ]]></value>
  </property>
  <property>
    <name>mapreduce.marklogic.input.query</name>
    <value><![CDATA[
xquery version "1.0-ml";
declare default element namespace "http://HadoopTest";
fn:collection("mycoll")//*:bar/*:foo
    ]]></value>
  </property>
</configuration>

For more details and examples, see the MarkLogic Connector for Hadoop Developer's Guide.

Copy Command Line Options

This section summarizes the command line options available with the mlcp copy command. The following command line options define your connection to MarkLogic:

Option Description
-input_host comma-list
Required. A comma separated list of hosts through which mlcp can connect to the source database. You must specify at least one host. For more details, see How mlcp Uses the Host List.
-input_port number
Port number of the source MarkLogic Server. There should be an XDBC App Server on this port. Default: 8000.
-input_username string
MarkLogic Server user with which to export documents. Required, unless using Kerberos authentication.
-input_password string
Password for the MarkLogic Server user specified with -input_username. Required, unless using Kerberos authentication.
-output_host comma-list
Required. A comma separated list of hosts through which mlcp can connect to the destination database. You must specify at least one host. For more details, see How mlcp Uses the Host List.
-output_port number
Port number of the destination MarkLogic Server. There should be an XDBC App Server on this port. Default: 8000.
-output_username string
MarkLogic Server user with which to import documents to the destination. Required, unless using Kerberos authentication.
-output_password string
Password for the MarkLogic Server user specified with -output_username. Required, unless using Kerberos authentication.

The following table lists command line options that define the characteristics of the copy operation:

Option Description
-batch_size number
The number of documents to load per request to MarkLogic Server. Default: 100. Maximum: 200.
-collection_filter comma-list
A comma-separated list of collection URIs. mlcp exports only documents in these collections, plus related metadata. This option may not be combined with -directory_filter. Default: All documents and related metadata.
-conf filename
Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-copy_collections boolean
Whether to copy document collections from the source database to the destination database. Default: true.
-copy_metadata boolean
Whether to copy document key-value metadata from the source database to the destination database. Default: true.
-copy_permissions boolean
Whether to copy document permissions from the source database to the destination database. Default: true.
-copy_properties boolean
Whether to copy document properties from the source database to the destination database. Default: true.
-copy_quality boolean
Whether to copy document quality from the source database to the destination database. Default: true.
-D property=value
Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-directory_filter comma-list
A comma-separated list of database directories. mlcp exports only documents from these directories, plus related metadata. Directory names should usually end with /. This option may not be combined with -collection_filter. Default: All documents and related metadata.
-document_selector string
Specifies an XPath expression used to select which documents are extracted from the source database. The XPath expression should select fragment roots. This option may not be combined with -directory_filter or -collection_filter. Default: All documents and related metadata.
-fastload boolean
Whether or not to force optimal performance, even at the risk of creating duplicate document URIs. See Time vs. Correctness: Understanding -fastload Tradeoffs. Default: false.
-hadoop_conf_dir string
When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode.
-input_database string
The name of the source database. Default: The database associated with the source App Server identified by -input_host and -input_port.
-input_ssl boolean
Enable/disable SSL secured communication with the input App Server. Default: false. If you set this option to true, your App Server must be SSL enabled. For details, see Connecting to MarkLogic Using SSL.
-input_ssl_protocol string
Specify the protocol mlcp should use when creating an SSL connection to the input App Server. You must include this option if you use the -input_ssl option to connect to an App Server configured to disable MarkLogic's default protocol (TLS). Allowed values: tls, tlsv1, tlsv1.1, tlsv1.2. Default: tls. This option may be ignored if you use a Hadoop Connector conf file for SSL configuration; for details, see Advanced SSL Configuration.
-max_split_size number
The maximum number of document fragments processed per split. Default: 50000.
-mode string
Copy mode. Accepted values: distributed, local. Distributed mode requires Hadoop. Default: local, unless you set the HADOOP_CONF_DIR variable; for details, see Configuring Distributed Mode.
-path_namespace comma-list
Specifies one or more namespace prefix bindings for namespace prefixes usable in path expressions passed to -document_selector. The list items should be alternating pairs of prefix names and namespace URIs, such as 'pfx1,http://my/ns1,pfx2,http://my/ns2'.
-options_file string
Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_collections comma-list
A comma separated list of collection URIs. Output documents are added to these collections.
-output_database string
The name of the destination database. Default: The database associated with the destination App Server identified by -output_host and -output_port.
-output_permissions comma-list
A comma separated list of (role,capability) pairs to apply to loaded documents. Default: The default permissions associated with the user inserting the document. Example: -output_permissions role1,read,role2,update
-output_quality string
The quality to assign to output documents.
-output_partition string
The name of the database partition in which to create documents. Required when using range assignment policy. For details, see How Assignment Policy Affects Optimization and Range Partitions in the Administrator's Guide.
-output_ssl boolean
Enable/disable SSL secured communication with the output App Server. Default: false. If you set this option to true, your App Server must be SSL enabled. For details, see Connecting to MarkLogic Using SSL.
-output_ssl_protocol string
Specify the protocol mlcp should use when creating an SSL connection to the output App Server. You must include this option if you use the -output_ssl option to connect to an App Server configured to disable MarkLogic's default protocol (TLS). Allowed values: tls, tlsv1, tlsv1.1, tlsv1.2. Default: tls. This option may be ignored if you use a Hadoop Connector conf file for SSL configuration; for details, see Advanced SSL Configuration.
-output_uri_prefix string
Specify a prefix to prepend to the default URI. Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion.
-output_uri_replace comma-list
A comma separated list of (regex,string) pairs that define string replacements to apply to the URIs of documents added to the database. The replacement strings must be enclosed in single quotes. For example, -output_uri_replace "regex1,'string1',regext2,'string2'"
-output_uri_suffix string
Specify a suffix to append to the default URI Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion.
-query_filter string
Specifies a query to apply when selecting documents to be copied. The argument must be the XML serialization of a cts:query or JSON serialization of a cts.query. Only documents in the source database that match the query are considered for copying. For details, see Controlling What is Exported, Copied, or Extracted. False postives are possible; for details, see Understanding When Filters Are Accurate.
-redaction comma-list
Apply one or more redaction rule collections. The argument must be a comma-separated list of rule collection URIs. The rule collections must be installed in the schemas database on the source MarkLogic installation. For details and example, see Redacting Content During Export or Copy Operations and Redacting Document Content in the Application Developer's Guide.
-restrict_input_hosts boolean
Restrict mlcp to connect to the source database only through the hosts listed in the -input_host option. Default: false (no restriction). For more details, see Restricting the Hosts mlcp Uses to Connect to MarkLogic.
-restrict_output_hosts boolean
Restrict mlcp to connect to the destination database only through the hosts listed in the -output_host option. Default: false (no restriction). For more details, see Restricting the Hosts mlcp Uses to Connect to MarkLogic.
-snapshot boolean
Whether or not to use a consistent point-in-time snapshot of the source database contents. Default: false. When true, the job submission time is used as the database read timestamp for selecting documents to export. For details, see Extracting a Consistent Database Snapshot.
-temporal_collection string
A temporal collection into which the documents are to be loaded in the destination database. For details on loading temporal documents into MarkLogic, see Using MarkLogic Content Pump (MLCP) to Load Temporal Documents in the Temporal Developer's Guide.
-thread_count number
The number of threads to spawn for concurrent copying. The total number of threads spawned by the process can be larger than this number, but this option caps the number of concurrent sessions with MarkLogic Server. Only available in local mode. Default: 4.
-transform_function string
The local name of a custom content transformation function installed on MarkLogic Server. Ignored if -transform_module is not specified. Default: transform. For details, see Transforming Content During Ingestion.
-transform_module string
The path in the modules database or modules directory of a custom content transformation function installed on MarkLogic Server. This option is required to enable a custom transformation. For details, see Transforming Content During Ingestion.
-transform_namespace string
The namespace URI of the custom content transformation function named by -transform_function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion.
-transform_param string
Optional extra data to pass through to a custom transformation function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion.
-transaction_size number
When loading documents into the destination database, the number of requests to MarkLogic Server in one transaction. Default: 10. Maximum: 4000/actualBatchSize.

« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy