Use the mlcp copy
command to copy content and associated metadata from one MarkLogic Server database to another when both are reachable on the network. You can also copy data from offline forests to a MarkLogic Server database; for details, see Using Direct Access to Extract or Copy Documents.
This chapter includes the following topics:
To copy one database to another with mclp:
-input_host
, -input_port
, -input_username
, and -input_password
to identify the source MarkLogic Server instance and user.-output_host
, -output_port
, -output_username
, and -output_password
to identify the destination MarkLogic Server instance and user.-collection_filter
to a comma separated list of collection URIs.-directory_filter
to a comma separated list of directory URIs.-document_selector
. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace
.-query_filter
. You can use this option alone or in combination with a directory, collection or document selector filter. False positives are possible; for details, see Understanding When Filters Are Accurate.-collection_filter
, -directory_filter
, -document_selector
, and -query_filter
unset.-copy_collections
to false
to exclude document collections metadata.-copy_permissions
to false
to exclude document permissions metadata.-copy_properties
to false
to exclude document properties.-copy_quality
to false
to exclude document quality metadata.-copy_metadata
to false
to exclude document key-value metadata.-output_uri_replace
, -output_uri_prefix
, and/or -output_uri_suffix
. For details, see Controlling Database URIs During Ingestion.For a complete list of mlcp copy command options, see Copy Command Line Options.
The following example copies all documents and their metadata from the source database to the destination database:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh copy -mode local -input_host srchost -input_port 8000 \ -input_username user1 -input_password password1 \ -output_host desthost -output_port 8010 -output_username user2 \ -output_password password2
The following example copies selected documents, excluding the source permissions and adding the documents to 2 new collections in the destination database:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh copy -mode local -input_host srchost -input_port 8000 \ -input_username user1 -input_password password1 \ -output_host desthost -output_port 8000 -output_username user2 \ -output_password password2 -copy_permissions false \ -output_collections shakespeare,plays
For an example of using -query_filter, see Example: Exporting Documents Matching a Query.
Redaction is the process of eliminating or obscuring portions of a document when retrieving the document from MarkLogic. For example, you can eliminate or mask sensitive personal information such as credit card numbers, phone numbers, or email addresses from documents. You can only redact document content, not document properties.
Redaction is performed as documents are read from the source database. For example, if you copy documents between databases in two different MarkLogic installations, the unredacted content never leaves the source installation.
Redaction support in MarkLogic is covered in detail in Redacting Content During Export or Copy Operations and Redacting Document Content in the Application Developer's Guide.
Use the -redaction
option to apply redaction rules during a copy. For example, the following command copies documents in the my_docs collection from one database to another, and applies the redaction rules in the rule collections hipaa-rules and biz-rules to the source documents before copying them to the destination database.
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh copy -mode local -input_host srchost -input_port 8000 \ -input_username user1 -input_password password1 \ -output_host desthost -output_port 8000 -output_username user2 \ -output_password password2 -collection_filter my_docs \ -redaction "hipaa-rules,biz-rules"
For more details, see Redacting Content During Export or Copy Operations.
The mlcp tool uses the MarkLogic Connector for Hadoop to distribute work across your MarkLogic cluster, even when run in local mode. When you use the mlcp copy command, the source MarkLogic Server instance acts as an input source for a Hadoop MapReduce job. Similarly, the destination MarkLogic Server instance acts as the output sink for the job. You can take low level control of the job by setting connector and Hadoop configuration properties.
This is an advanced technique. You should understand how to use the MarkLogic Connector for Hadoop before attempting this.
The following list describes some use cases in which you might choose to set low level configuration properties:
Similar use cases and techniques apply to export operations. For details, see Advanced Document Selection and Transformation.
The following table lists some connector and Hadoop configuration properties relevant to advanced configuration for copy.
When you take low-level control of a copy operation, you can no longer use options such as -copy_collections
, -copy_permissions
, and -copy_properties
to copy the various categories of metadata from the source database to the destination database. If you include the -copy_
* options on the mlcp command line, they will be ignored.
You can pass a connector configuration file through mlcp with the -conf
option. The -conf
option must appear after -options_file
(if present) and before any other mlcp options. The following example command demonstrates using the -conf
option in a copy operation.
$ mlcp.sh copy -conf conf.xml -input_host srchost -input_port 8000 \ -input_username user -input_password password \ -output_host desthost -output_port 8000 \ -output_username user -output_password password \ -mode local
The following example connector configuration file includes an XQuery split query that selects documents from a specific collection (similar to what the -collection_filter option does), and an XQuery input query that selects specific elements of each document.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.job.inputformat.class</name> <value>com.marklogic.mapreduce.DocumentInputFormat</value> </property> <property> <name>mapreduce.job.outputformat.class</name> <value>com.marklogic.mapreduce.ContentOutputFormat</value> </property> <property> <name>mapreduce.marklogic.input.mode</name> <value>advanced</value> </property> <property> <name>mapreduce.marklogic.input.splitquery</name> <value><![CDATA[ xquery version "1.0-ml"; declare namespace wp="http://www.mediawiki.org/xml/export-0.4/"; import module namespace admin ="http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; let $conf := admin:get-configuration() for $forest in xdmp:database-forests(xdmp:database()) let $host_id :=admin:forest-get-host($conf,$forest) let $host_name := admin:host-get-name($conf,$host_id) let $cnt := xdmp:estimate( cts:search(fn:collection("mycoll"), cts:and-query(()),(),0.0,$forest)) return ($forest,$cnt,$host_name) ]]></value> </property> <property> <name>mapreduce.marklogic.input.query</name> <value><![CDATA[ xquery version "1.0-ml"; declare default element namespace "http://HadoopTest"; fn:collection("mycoll")//*:bar/*:foo ]]></value> </property> </configuration>
For more details and examples, see the MarkLogic Connector for Hadoop Developer's Guide.
This section summarizes the command line options available with the mlcp copy
command. The following command line options define your connection to MarkLogic:
Option | Description |
---|---|
-input_host comma-list |
Required. A comma separated list of hosts through which mlcp can connect to the source database. You must specify at least one host. For more details, see How mlcp Uses the Host List. |
-input_port number |
Port number of the source MarkLogic Server. There should be an XDBC App Server on this port. Default: 8000. |
-input_username string |
MarkLogic Server user with which to export documents. Required, unless using Kerberos authentication. |
-input_password string |
Password for the MarkLogic Server user specified with -input_username . Required, unless using Kerberos authentication. |
-output_host comma-list |
Required. A comma separated list of hosts through which mlcp can connect to the destination database. You must specify at least one host. For more details, see How mlcp Uses the Host List. |
-output_port number |
Port number of the destination MarkLogic Server. There should be an XDBC App Server on this port. Default: 8000. |
-output_username string |
MarkLogic Server user with which to import documents to the destination. Required, unless using Kerberos authentication. |
-output_password string |
Password for the MarkLogic Server user specified with -output_username . Required, unless using Kerberos authentication. |
The following table lists command line options that define the characteristics of the copy operation:
Option | Description |
---|---|
-batch_size number |
The number of documents to load per request to MarkLogic Server. Default: 100. Maximum: 200. |
-collection_filter comma-list |
A comma-separated list of collection URIs. mlcp exports only documents in these collections, plus related metadata. This option may not be combined with -directory_filter . Default: All documents and related metadata. |
-conf filename |
Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options. |
-copy_collections boolean |
Whether to copy document collections from the source database to the destination database. Default: true . |
-copy_metadata boolean |
Whether to copy document key-value metadata from the source database to the destination database. Default: true . |
-copy_permissions boolean |
Whether to copy document permissions from the source database to the destination database. Default: true . |
-copy_properties boolean |
Whether to copy document properties from the source database to the destination database. Default: true . |
-copy_quality boolean |
Whether to copy document quality from the source database to the destination database. Default: true . |
-D property=value |
Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options. |
-directory_filter comma-list |
A comma-separated list of database directories. mlcp exports only documents from these directories, plus related metadata. Directory names should usually end with /. This option may not be combined with -collection_filter . Default: All documents and related metadata. |
-document_selector string |
Specifies an XPath expression used to select which documents are extracted from the source database. The XPath expression should select fragment roots. This option may not be combined with -directory_filter or -collection_filter . Default: All documents and related metadata. |
-fastload boolean |
Whether or not to force optimal performance, even at the risk of creating duplicate document URIs. See Time vs. Correctness: Understanding -fastload Tradeoffs. Default: false . |
-hadoop_conf_dir string |
When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode. |
-input_database string |
The name of the source database. Default: The database associated with the source App Server identified by -input_host and -input_port . |
-input_ssl boolean |
Enable/disable SSL secured communication with the input App Server. Default: false. If you set this option to true, your App Server must be SSL enabled. For details, see Connecting to MarkLogic Using SSL. |
-input_ssl_protocol string |
Specify the protocol mlcp should use when creating an SSL connection to the input App Server. You must include this option if you use the -input_ssl option to connect to an App Server configured to disable MarkLogic's default protocol (TLS). Allowed values: tls , tlsv1 , tlsv1.1 , tlsv1.2 . Default: tls . This option may be ignored if you use a Hadoop Connector conf file for SSL configuration; for details, see Advanced SSL Configuration. |
-max_split_size number |
The maximum number of document fragments processed per split. Default: 50000. |
-mode string |
Copy mode. Accepted values: distributed , local . Distributed mode requires Hadoop. Default: local , unless you set the HADOOP_CONF_DIR variable; for details, see Configuring Distributed Mode. |
-path_namespace comma-list |
Specifies one or more namespace prefix bindings for namespace prefixes usable in path expressions passed to -document_selector . The list items should be alternating pairs of prefix names and namespace URIs, such as 'pfx1,http://my/ns1,pfx2,http://my/ns2' . |
-options_file string |
Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax. |
-output_collections comma-list |
A comma separated list of collection URIs. Output documents are added to these collections. |
-output_database string |
The name of the destination database. Default: The database associated with the destination App Server identified by -output_host and -output_port . |
-output_permissions comma-list |
A comma separated list of (role,capability) pairs to apply to loaded documents. Default: The default permissions associated with the user inserting the document. Example: -output_permissions role1,read,role2,update |
-output_quality string |
The quality to assign to output documents. |
-output_partition string |
The name of the database partition in which to create documents. Required when using range assignment policy. For details, see How Assignment Policy Affects Optimization and Range Partitions in the Administrator's Guide. |
-output_ssl boolean |
Enable/disable SSL secured communication with the output App Server. Default: false. If you set this option to true, your App Server must be SSL enabled. For details, see Connecting to MarkLogic Using SSL. |
-output_ssl_protocol string |
Specify the protocol mlcp should use when creating an SSL connection to the output App Server. You must include this option if you use the -output_ssl option to connect to an App Server configured to disable MarkLogic's default protocol (TLS). Allowed values: tls , tlsv1 , tlsv1.1 , tlsv1.2 . Default: tls . This option may be ignored if you use a Hadoop Connector conf file for SSL configuration; for details, see Advanced SSL Configuration. |
-output_uri_prefix string |
Specify a prefix to prepend to the default URI. Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion. |
-output_uri_replace comma-list |
A comma separated list of (regex,string ) pairs that define string replacements to apply to the URIs of documents added to the database. The replacement strings must be enclosed in single quotes. For example, -output_uri_replace "regex1,'string1',regext2,'string2'" |
-output_uri_suffix string |
Specify a suffix to append to the default URI Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion. |
-query_filter string |
Specifies a query to apply when selecting documents to be copied. The argument must be the XML serialization of a cts:query or JSON serialization of a cts.query. Only documents in the source database that match the query are considered for copying. For details, see Controlling What is Exported, Copied, or Extracted. False postives are possible; for details, see Understanding When Filters Are Accurate. |
-redaction comma-list |
Apply one or more redaction rule collections. The argument must be a comma-separated list of rule collection URIs. The rule collections must be installed in the schemas database on the source MarkLogic installation. For details and example, see Redacting Content During Export or Copy Operations and Redacting Document Content in the Application Developer's Guide. |
-restrict_input_hosts boolean |
Restrict mlcp to connect to the source database only through the hosts listed in the -input_host option. Default: false (no restriction). For more details, see Restricting the Hosts mlcp Uses to Connect to MarkLogic. |
-restrict_output_hosts boolean |
Restrict mlcp to connect to the destination database only through the hosts listed in the -output_host option. Default: false (no restriction). For more details, see Restricting the Hosts mlcp Uses to Connect to MarkLogic. |
-snapshot boolean |
Whether or not to use a consistent point-in-time snapshot of the source database contents. Default: false . When true , the job submission time is used as the database read timestamp for selecting documents to export. For details, see Extracting a Consistent Database Snapshot. |
-temporal_collection string |
A temporal collection into which the documents are to be loaded in the destination database. For details on loading temporal documents into MarkLogic, see Using MarkLogic Content Pump (MLCP) to Load Temporal Documents in the Temporal Developer's Guide. |
-thread_count number |
The number of threads to spawn for concurrent copying. The total number of threads spawned by the process can be larger than this number, but this option caps the number of concurrent sessions with MarkLogic Server. Only available in local mode. Default: 4. |
-transform_function string |
The local name of a custom content transformation function installed on MarkLogic Server. Ignored if -transform_module is not specified. Default: transform . For details, see Transforming Content During Ingestion. |
-transform_module string |
The path in the modules database or modules directory of a custom content transformation function installed on MarkLogic Server. This option is required to enable a custom transformation. For details, see Transforming Content During Ingestion. |
-transform_namespace string |
The namespace URI of the custom content transformation function named by -transform_function . Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion. |
-transform_param string |
Optional extra data to pass through to a custom transformation function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion. |
-transaction_size number |
When loading documents into the destination database, the number of requests to MarkLogic Server in one transaction. Default: 10. Maximum: 4000/actualBatchSize. |