Loading TOC...
mlcp User Guide (PDF)

mlcp User Guide — Chapter 6

Copying Content Between Databases

Use the mlcp copy command to copy content and associated metadata from one MarkLogic Server database to another when both are reachable on the network. You can also copy data from offline forests to a MarkLogic Server database; for details, see Using Direct Access to Extract or Copy Documents.

This chapter includes the following topics:

Basic Steps

To copy one database to another with mclp:

  1. Set -input_host, -input_port, -input_username, and -input_password to identify the source MarkLogic Server instance and user.
  2. Set -output_host, -output_port, -output_username, and -output_password to identify the destination MarkLogic Server instance and user.
  3. Select what documents to copy. For details, see Filtering Archive and Copy Contents.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select documents matching an XPath expression, use -document_selector. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace.
    • To select document matching a query, use -query_filter. You can use this option alone or in combination with a directory, collection or document selector filter. False positives are possible; for details, see Understanding When Filters Are Accurate.
    • To select all documents in the database, leave -collection_filter, -directory_filter, -document_selector, and -query_filter unset.
  4. If you want to exclude some or all source document metadata:
    • Set -copy_collections to false to exclude document collections metadata.
    • Set -copy_permissions to false to exclude document permissions metadata.
    • Set -copy_properties to false to exclude document properties.
    • Set -copy_quality to false to exclude document quality metadata.
  5. If you want to add or override document metadata in the destination database:
    • Set -output_collections to add destination documents to a collection.
    • Set -output_permissions to add permissions to destination documents.
    • Set -output_quality to set the quality of destination documents.
  6. If you want the destination documents to have database URIs different from the source URIs, set -output_uri_replace, -output_uri_prefix, and/or -output_uri_suffix. For details, see Controlling Database URIs During Ingestion.

For a complete list of mlcp copy command options, see Copy Command Line Options.

Examples

The following example copies all documents and their metadata from the source database to the destination database:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh copy -mode local -input_host srchost -input_port 80008000 \
    -input_username user1 -input_password password1 \
    -output_host desthost -output_port 8010 -output_username user2 \
    -output_password password2

The following example copies selected documents, excluding the source permissions and adding the documents to 2 new collections in the destination database:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh copy -mode local -input_host srchost -input_port 8000 \
    -input_username user1 -input_password password1 \
    -output_host desthost -output_port 8000 -output_username user2 \
    -output_password password2 -copy_permissions false \
    -output_collections shakespeare,plays

For an example of using -query_filter, see Example: Exporting Documents Matching a Query.

Advanced Document Selection for Copy

The mlcp tool uses the MarkLogic Connector for Hadoop to distribute work across your MarkLogic cluster, even when run in local mode. When you use the mlcp copy command, the source MarkLogic Server instance acts as an input source for a Hadoop MapReduce job. Similarly, the destination MarkLogic Server instance acts as the output sink for the job. You can take low level control of the job by setting connector and Hadoop configuration properties.

This is an advanced technique. You should understand how to use the MarkLogic Connector for Hadoop before attempting this. For details, see Advanced Input Mode in the MarkLogic Connector for Hadoop Developer's Guide.

The following list describes some use cases in which you might choose to set low level configuration properties:

Similar use cases and techniques apply to export operations. For details, see Advanced Document Selection and Transformation.

The following table lists some connector and Hadoop configuration properties relevant to advanced configuration for copy.

Configuration PropertyDescription
mapreduce.marklogic.input.modeControls whether the connectors runs in basic or advanced mode. Set to 'advanced'.
mapreduce.marklogic.input.splitqueryA query that generates input splits. This distributes the work required to extract documents from the source database. The query can be either XQuery or Server-Side JavaScript. For details, see Creating Input Splits in the MarkLogic Connector for Hadoop Developer's Guide.
mapreduce.marklogic.input.queryA query that selects the input fragments to extract from the source database. You can use the input query to apply server-side transformations to each output item. The query can be either XQuery or Server-Side JavaScript. For details, see Creating Input Key-Value Pairs in the MarkLogic Connector for Hadoop Developer's Guide.
mapreduce.inputformat.class

This property identifies a subclass of the connector InputFormat class, describing the 'type' of the values produced by your input query.

You can create your own InputFormat subclass, but most applications will use one of the classes defined by the connector, such as DocumentInputFormat, which is the default used by mlcp. For details, see InputFormat Subclasses in the MarkLogic Connector for Hadoop Developer's Guide.

mapreduce.outputformat.classThis property identifies a subclass of the connector OutputFormat class, describing the 'type' of input for the destination database. In most cases, you should use ContentOutputFormat. For details, see OutputFormat Subclasses in the MarkLogic Connector for Hadoop Developer's Guide.
mapreduce.map.classOptional. This property identifies a subclass of org.apache.hadoop.mapreduce.Mapper. Defaults to com.marklogic.contentpump.DocumentMapper, but you override if for more advanced use cases. For details, see Defining the Map Function in the MarkLogic Connector for Hadoop Developer's Guide.

When you take low-level control of a copy operation, you can no longer use options such as -copy_collections, -copy_permissions, and -copy_properties to copy the various categories of metadata from the source database to the destination database. If you include the -copy_* options on the mlcp command line, they will be ignored.

You can pass a connector configuration file through mlcp with the -conf option. The -conf option must appear after -options_file (if present) and before any other mlcp options. The following example command demonstrates using the -conf option in a copy operation.

$ mlcp.sh copy -conf conf.xml -input_host srchost -input_port 8000 \
    -input_username user -input_password password \
    -output_host desthost -output_port 8000 \
    -output_username user -output_password password \
    -mode local

The following example connector configuration file includes an XQuery split query that selects documents from a specific collection (similar to what the -collection_filter option does), and an XQuery input query that selects specific elements of each document.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapreduce.job.inputformat.class</name>
    <value>com.marklogic.mapreduce.DocumentInputFormat</value>
  </property>
  <property>
    <name>mapreduce.job.outputformat.class</name>
    <value>com.marklogic.mapreduce.ContentOutputFormat</value>
  </property>
  <property>
    <name>mapreduce.marklogic.input.mode</name>
    <value>advanced</value>
  </property>
  <property>
    <name>mapreduce.marklogic.input.splitquery</name>
    <value><![CDATA[
xquery version "1.0-ml";
declare namespace wp="http://www.mediawiki.org/xml/export-0.4/";
import module namespace admin ="http://marklogic.com/xdmp/admin" 
  at "/MarkLogic/admin.xqy";
let $conf := admin:get-configuration()
for $forest in xdmp:database-forests(xdmp:database())
let $host_id :=admin:forest-get-host($conf,$forest)
let $host_name := admin:host-get-name($conf,$host_id)
let $cnt := xdmp:estimate(
  cts:search(fn:collection("mycoll"),
             cts:and-query(()),(),0.0,$forest))
return
($forest,$cnt,$host_name)
    ]]></value>
  </property>
  <property>
    <name>mapreduce.marklogic.input.query</name>
    <value><![CDATA[
xquery version "1.0-ml";
declare default element namespace "http://HadoopTest";
fn:collection("mycoll")//*:bar/*:foo
    ]]></value>
  </property>
</configuration>

For more details and examples, see the MarkLogic Connector for Hadoop Developer's Guide.

Copy Command Line Options

This section summarizes the command line options available with the mlcp copy command. The following command line options define your connection to MarkLogic:

OptionDescription
-input_host string
Hostname of the source MarkLogic Server. Required.
-input_port number
Port number of the source MarkLogic Server. There should be an XDBC App Server on this port. The App Server must not be SSL-enabled. Default: 8000.
-input_username string
MarkLogic Server user with which to export documents. Required, unless using Kerberos authentication.
-input_password string
Password for the MarkLogic Server user specified with -input_username. Required, unless using Kerberos authentication.
-output_host string
Hostname of the destination MarkLogic Server. Required.
-output_port number
Port number of the destination MarkLogic Server. There should be an XDBC App Server on this port. The App Server must not be SSL-enabled. Default: 8000.
-output_username string
MarkLogic Server user with which to import documents to the destination. Required, unless using Kerberos authentication.
-output_password string
Password for the MarkLogic Server user specified with -output_username. Required, unless using Kerberos authentication.

The following table lists command line options that define the characteristics of the copy operation:

OptionDescription
-batch_size number
The number of documents to load per request to MarkLogic Server. This option is ignored when you use -transform_module; a transform always sets the batch size to 1. Default: 100. Maximum: 200.
-collection_filter comma-list
A comma-separated list of collection URIs. mlcp exports only documents in these collections, plus related metadata. This option may not be combined with -directory_filter. Default: All documents and related metadata.
-conf filename
Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-copy_collections boolean
Whether to copy document collections from the source database to the destination database. Default: true.
-copy_permissions boolean
Whether to copy document permissions from the source database to the destination database. Default: true.
-copy_properties boolean
Whether to copy document properties from the source database to the destination database. Default: true.
-copy_quality boolean
Whether to copy document quality from the source database to the destination database. Default: true.
-D property=value
Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-directory_filter comma-list
A comma-separated list of database directories. mlcp exports only documents from these directories, plus related metadata. Directory names should usually end with '/'. This option may not be combined with -collection_filter. Default: All documents and related metadata.
-document_selector string
Specifies an XPath expression used to select which documents are extracted from the source database. The XPath expression should select fragment roots. This option may not be combined with -directory_filter or -collection_filter. Default: All documents and related metadata.
-fastload boolean
Whether or not to force optimal performance, even at the risk of creating duplicate document URIs. See Time vs. Correctness: Understanding -fastload Tradeoffs. Default: false.
-hadoop_conf_dir string
When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode.
-input_database string
The name of the source database. Default: The database associated with the source App Server identified by -input_host and -input_port.
-max_split_size number
The maximum number of document fragments processed per split. Default: 50000.
-mode string
Copy mode. Accepted values: distributed, local. Distributed mode requires Hadoop. Default: local, unless you set the HADOOP_CONF_DIR variable; for details, see Configuring Distributed Mode.
-path_namespace comma-list
Specifies one or more namespace prefix bindings for namespace prefixes usable in path expressions passed to -document_selector. The list items should be alternating pairs of prefix names and namespace URIs, such as 'pfx1,http://my/ns1,pfx2,http://my/ns2'.
-options_file string
Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_collections comma-list
A comma separated list of collection URIs. Output documents are added to these collections.
-output_database string
The name of the destination database. Default: The database associated with the destination App Server identified by -output_host and -output_port.
-output_permissions comma-list
A comma separated list of (role,capability) pairs to apply to loaded documents. Default: The default permissions associated with the user inserting the document. Example: -output_permissions role1,read,role2,update
-output_quality string
The quality to assign to output documents.
-output_partition string
The name of the database partition in which to create documents. Required when using range assignment policy. For details, see How Assignment Policy Affects Optimization and Partitions, Partition Keys, and Partition Ranges in the Administrator's Guide.
-output_uri_prefix string
Specify a prefix to prepend to the default URI Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion.
-output_uri_replace comma-list
A comma separated list of (regex,string) pairs that define string replacements to apply to the URIs of documents added to the database. The replacement strings must be enclosed in single quotes. For example, -output_uri_replace "regex1,'string1',regext2,'string2'"
-output_uri_suffix string
Specify a suffix to append to the default URI Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion.
-query_filter string
Specifies a query to apply when selecting documents to be copied. The argument must be the XML serialization of a cts:query or JSON serialization of a cts.query. Only documents in the source database that match the query are considered for copying. For details, see Controlling What is Exported, Copied, or Extracted. False postives are possible; for details, see Understanding When Filters Are Accurate.
-snapshot boolean
Whether or not to use a consistent point-in-time snapshot of the source database contents. Default: false. When true, the job submission time is used as the database read timestamp for selecting documents to export. For details, see Extracting a Consistent Database Snapshot.
-temporal_collection string
A temporal collection into which the documents are to be loaded in the destination database. For details on loading temporal documents into MarkLogic, see Using MarkLogic Content Pump (MLCP) to Load Temporal Documents in the Temporal Developer's Guide.
-thread_count number
The number of threads to spawn for concurrent copying. The total number of threads spawned by the process can be larger than this number, but this option caps the number of concurrent sessions with MarkLogic Server. Only available in local mode. Default: 4.
-transform_function string
The localname of a custom content transformation function installed on MarkLogic Server. Ignored if -transform_module is not specified. Default: transform. For details, see Transforming Content During Ingestion.
-transform_module string
The path in the modules database or modules directory of a custom content transformation function installed on MarkLogic Server. This option is required to enable a custom transformation. For details, see Transforming Content During Ingestion.
-transform_namespace string
The namespace URI of the custom content transformation function named by -transform_function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion.
-transform_param string
Optional extra data to pass through to a custom transformation function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion.
-transaction_size number
When loading documents into the destination database, the number of requests to MarkLogic Server in one transaction. Default: 10. Maximum: 4000/actualBatchSize.

« Previous chapter
Next chapter »
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy