You can export content in a MarkLogic Server database to files or an archive. Use archives to copy content from one MarkLogic Server database to another. Output can be written to the native filesystem or to HDFS.
For a list of export related command line options, see Export Command Line Options.
You can also use mlcp to extract documents directly from offline forests. For details, see Using Direct Access to Extract or Copy Documents.
This section covers the following topics:
Use the mlcp export
command to export documents in their original format as files on the native filesystem or HDFS. For example, you can export an XML document as a text file containing XML, or a binary document as a JPG image.
To export documents from a database as files:
-collection_filter
to a comma separated list of collection URIs.-directory_filter
to a comma separated list of directory URIs.-document_selector
. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace
.-query_filter
, alone or in combination with one of the other filter options. False postives are possible; for details, see Understanding When Filters Are Accurate.-collection_filter
, -directory_filter
, -document_selector
, and -query_filter
unset.-output_file_path
to the destination file or directory on the native filesystem or HDFS.-indented
to true.Directory names specified with -directory_filter
should end with /.
When using -document_selector
to filter by XPath expression, you can define namespace prefixes using the -path_namespace
option. For example:
-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2' -document_selector '/ex1:elem[ex2:attr > 10]'
Document URIs are URI-decoded before filesystem directories or filenames are constructed for them. For details, see How URI Decoding Affects Output File Names.
For a full list of export options, see Export Command Line Options.
The following example exports selected documents in the database to the native filesystem directory /space/mlcp/export/files
. The directory filter selects only the documents in /plays
.
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh export -host localhost -port 8000 -username user \ -password password -mode local -output_file_path \ /space/mlcp/export/files -directory_filter /plays/
Use the mlcp export
command to export documents in their original format as files in a compressed ZIP file on the native filesystem or HDFS.
To export documents from a database as files:
-collection_filter
to a comma separated list of collection URIs.-directory_filter
to a comma separated list of directory URIs.-document_selector
. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace
.-query_filter
, alone or in combination with one of the other filter options. False postives are possible; for details, see Understanding When Filters Are Accurate.-collection_filter
, -directory_filter
,-document_selector
, and -query_filter
unset.-output_file_path
to the destination directory on the native filesystem or HDFS. This directory must not already exist.-compress
to true
.-indented
to true. For a full list of export options, see Export Command Line Options.
The zip files created by export have filenames of the form timestamp-
seqnum.zip
.
The following example exports all the documents in the database to the directory /space/examples/export
on the native filesystem.
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh export -host localhost -port 8000 -username user \ -password password -mode local \ -output_file_path /space/examples/export -compress true$ ls /space/examples/export 20120823135307-0700-000000-XML.zip
Use the mlcp export
command with an output type of archive
to create a database archive that includes content and metadata. You can use the mlcp import
command to copy the archive to another database or restore database contents.
To export database content to an archive file with mlcp:
-collection_filter
to a comma separated list of collection URIs.-directory_filter
to a comma separated list of directory URIs.-document_selector
. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace
.-query_filter
, alone or in combination with one of the other filter options. False postives are possible; for details, see Understanding When Filters Are Accurate.-collection_filter
, -directory_filter
, -document_selector
, and -query_filter
unset.-output_file_path
to the destination directory on the native filesystem or HDFS. This directory must not already exist.-output_type
to archive
.-copy_collections
to false
to exclude document collections metadata.-copy_permissions
to false
to exclude document permissions metadata.-copy_properties
to false
to exclude document properties.-copy_quality
to false
to exclude document quality metadata.-copy_metadata
to false
to exclude document key-value metadata.For a full list of export options, see Export Command Line Options.
The following example exports all documents and metadata to the directory /space/examples/exported
. After export, the directory contains one or more compressed archive files.
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh export -host localhost -port 8000 -username user \ -password password -mode local \ -output_file_path /space/examples/exported -output_type archive
The following example exports only documents in the database directory /plays/
, including their collections, properties, and quality, but excluding permissions:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh export -host localhost -port 8000 -username user \ -password password -mode local \ -output_file_path /space/examples/exported -output_type archive \ -copy_permissions false -directory_filter /plays/
You can use the mlcp import
command to import an archive into a database. For details, see Loading Content and Metadata From an Archive.
This discussion only applies when -output_type
is document
.
When you export a document to a file (or to a file in a compressed file), the output file name is based on the document URI. The document URI is decoded to form the file name. For example, if the document URI is foo%20bar.xml, then the output file name is foo bar.xml.
If the document URI does not conform to the standard URI syntax of RFC 3986, decoding may fail, resulting in unexpected file names. For example, if the document URI contains unescaped special characters then the raw URI may be used.
If the document URI contains a scheme, the scheme is removed. If the URI contains both a scheme and an authority, both are removed. For example, if the document URI is file:foo/bar.xml, then the output file path is output_file_path/foo/bar.xml
. If the document URI is http://marklogic.com/examples/bar.xml (contains a scheme and an authority), then the output file path is output_file_path/examples/bar.xml
.
If the document URI includes directory steps, then corresponding output subdirectories are created. For example, if the document URI is /foo/bar.xml, then the output file path is output_file_path/foo/bar.xml
.
By default, mlcp exports all documents or all documents and metadata in the database, depending on whether you are exporting in document or archive format or copying the database. Several command line options are available to enable customization. This section covers the following topics:
This section covers options available for filtering what is exported by the mlcp export
command when -output_type
is document
.
By default, mlcp exports all documents in the database. That is, mlcp exports the equivalent of fn:collection()
. The following options allow you to filter what is exported. These options are mutually exclusive.
-directory_filter
- export only the documents in the listed database directories. You cannot use this option with -collection_filter
or -document-selector
.-collection_filter
- export only the documents in the listed collections. You cannot use this option with -directory_filter
or -document_selector
.-document_selector
- export only documents selected by the specified XPath expression. You cannot use this option with -directory_filter
or -collection_filter
. Use -path_namespace
to define namespace prefixes.-query_filter
- export only documents matched by the specified cts query. You can use this option alone or in combination with a directory, collection or document selector filter. You can only use this filter with the export
and copy
commands. Results may not be accurate; for details, see Understanding When Filters Are Accurate.When filtering with a document selector, the XPath filtering expression should select fragment roots only. An XPath expression that selects nodes below the root is very inefficient.
When using -document_selector to filter by XPath expression, you can define namespace prefixes using the -path_namespace
option. For example:
-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2' -document_selector '/ex1:elem[ex2:attr > 10]'
This section covers options available for controlling what is exported by mlcp export when -output_type
is archive
, or what is copied by the mlcp copy
command.
By default, all documents and metadata are exported/copied. The following options allow you to modify this behavior:
-directory_filter
- export/copy only the documents in the listed database directories, including related metadata. You cannot use this option with -collection_filter
or -document_selector
.-collection_filter
- export/copy only the documents in the listed collections, including related metadata. You cannot use this options with -directory_filter
or -document_selector
.-document_selector
- export/copy only documents selected by the specified XPath expression.You cannot use this option with -directory_filter
or -collection_filter
. Use -path_namespace
to define namespace prefixes.-query_filter
- export/copy only documents matched by the specified cts query. You can use this option alone or in combination with a directory, collection or document selector filter. Results may not be accurate; for details, see Understanding When Filters Are Accurate.-copy_collections
- whether to include collection metadata-copy_permissions
- whether to include permissions metadata-copy_properties
- whether to include naked and document properties-copy_quality
- whether to include document quality metadata-copy_metadata
- whether to include document key-value metadataIf you set all the -copy_*
options to false
when exporting to an archive, the archive contains no metadata. When you import an archive with no metadata, you must set -archive_metadata_optional
to true
.
When filtering with a document selector, the XPath filtering expression should select fragment roots only. An XPath expression that selects nodes below the root is very inefficient.
When using -document_selector
to filter by XPath expression, you can define namespace prefixes using the -path_namespace
option. For example:
-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2' -document_selector '/ex1:elem[ex2:attr > 10]'
When you use -directory_filter
, -collection_filter
, or -document_selector
without -query_filter
, the set of documents selected by mlcp exactly matches your filtering criteria.
The query you supply with -query_filter
is used in an unfiltered search, which means there can be false positives among the selected documents. When you combine -query_filter
with -directory_filter
, -collection_filter
, or -document_selector
, mlcp might select documents that do not meet your directory, collection, or path filter criteria.
The interaction between -query_filter
and the other filtering options is similar to the following. In this example, the search can match documents that are not in the parts collection.
-collection_filter parts -query_filter yourSerializedQuery ==> selects the documents to export similar to the following: cts:search( fn:collection("parts"), yourQuery, ("unfiltered"))
For a complete example using -query_filter
, see Example: Exporting Documents Matching a Query.
To learn more about the implications of unfiltered searches, see Fast Pagination and Unfiltered Searches in the Query Performance and Tuning Guide.
This example demonstrates how to use -query_filter
to select documents for export. You can apply the same technique to filtering the source documents when copying documents from one database to another.
The -query_filter
option accepts a serialized XML cts:query or JSON cts.query as its value. For example, the following table shows the serialization of a cts word query, prettyprinted for readability:
Format | Example |
---|---|
XML | <cts:word-query xmlns:cts="http://marklogic.com/cts"> <cts:text xml:lang="en">mark</cts:text> </cts:word-query> |
JSON | {"wordQuery":{ "text":["huck"], "options":["lang=en"] }} |
For details on how to obtain the serialized representation of a cts query, see Serializations of cts:query Constructors in the Search Developer's Guide.
Using an options file is recommended when using -query_filter
because both XML and JSON serialized queries contain quotes and other characters that have special meaning to the Unix and Windows command shells, making it challenging to properly escape the query. If you use -query_filter
on the command line, you must quote the serialized query and may need to do additional special character escaping.
For example, you can create an options file similar to the following. It should contain at least 2 lines: One for the option name and one for the serialized query. You can include other options in the file. For details, see Options File Syntax.
If you save the above option in a file named query_filter.txt, then the following mlcp command exports files from the database that contain the word huck:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh export -host localhost -port 8000 -username user \ -password password -mode local -output_file_path \ /space/mlcp/export/files -options_file query_filter.txt
You can combine -query_filter
with another filtering option. For example, the following command combines the query with a collection filter. The command exports only documents containing the word huck in the collection named classics:
$ mlcp.sh export -host localhost -port 8000 -username user \ -password password -mode local -output_file_path \ /space/mlcp/export/files -options_file query_filter.txt -collection_filter classics
The documents selected by -query_filter
can include false positives, including documents that do not match other filter criteria. For details, see Understanding When Filters Are Accurate.
The following example demonstrates generating a serialized XML cts:and-query or JSON cts.andQuery using the wrapper technique. Copy either example into Query Console, select the appropriate query type, and run it to see the output.
Notice that in the XML example, the xdmp:quote indent option is used to disable XML prettyprinting, making the output better suited for inclusion on the mlcp command line:
xdmp:quote( <query>{$query}</query>/*, <options xmlns="xdmp:quote"><indent>no</indent></options> )
Notice that in the JavaScript example, it is necessary to call toObject
on the wrapped query to get the proper JSON serialization. Using toObject
converts the value to a JavaScript object which xdmp.quote will serialize as JSON.
xdmp.quote(wrapper.query.toObject())
If you want to test your serialized query before using it with mlcp, you can round-trip your XML query with cts:search in XQuery or your JSON query with cts.search or the JSearch API in Server-Side JavaScript, as shown in the following examples.
Note that xdmp:unquote returns a document node in XQuery, so you need to use XPath to address the underlying query element root node when reconstructing the query:
cts:query(xdmp:unquote($q)/*[1])
Similarly, xdmp.unquote in JavaScript returns a Sequence
on document nodes, so you must dereference both the iterator and the document node when reconstructing the query:
cts.query(fn.head(xdmp.unquote(serializedQ)).root)
This section covers options available for filtering what is extracted from from a forest when you use Direct Access. That is, when you use the mlcp import
command with -input_file_type forest
or the mlcp extract
command.
By default, mlcp extracts all documents in the input forests. That is, mlcp extracts the equivalent of fn:collection()
. The following options allow you to filter what is extracted from a forest with Direct Access. These options can be combined.
-type_filter
: Extract only documents with the listed content type (text, XML, or binary).-directory_filter
: Extract only the documents in the listed database directories. -collection_filter
: Extract only the documents in the listed collections. For example, following combination of options extracts only XML documents in the collections named 2004 or 2005.
mlcp.sh extract -type_filter xml -collection_filter "2004,2005" ...
Similarly, the following options import only binary documents in the source database directory /images/
:
mlcp.sh import -input_file_type forest \ -type_filter binary -directory_filter /images/
When you use Direct Access, filtering is performed in the process that reads the forest files rather than being performed by MarkLogic Server. For example, in local mode, filters are applied by mlcp on the host where you run it; in distributed mode, filters are applied by each Hadoop task that reads in forest data.
In addition, filtering cannot be applied until after a document is read from the forest. When you import or extract files from a forest file, mlcp must touch every document in the forest.
For details, see Using Direct Access to Extract or Copy Documents.
By default, when you export or copy database contents, content is extracted from the source database at multiple points in time. You get whatever is in the database when mlcp accesses a given document. If the database contents are changing while the job runs, the results are not deterministic relative to the starting time of the job. For example, if a new document is inserted into the database while an export job is running, it might or might not be included in the export.
If you require a consistent snapshot of the database contents during an export or copy, use the -snapshot
option to force all documents to be read from the database at a consistent point in time. The submission time of the job is used as the timestamp. Any changes to the database occurring after this time are not reflected in the output.
If a merge occurs while exporting or copying a consistent snapshot, and the merge eliminates a fragment that is subsequently accessed by the mlcp job, you may get an XDMP-OLDSTAMP
error. If this occurs, the documents included in the same batch or task may not be included in the export/copy result. If the source database is on MarkLogic Server 7 or later, you may be able to work around this problem by setting the merge timestamp to retain fragments for a time period longer than the expected running time of the job; for details, see Understanding and Controlling Database Merges in the Administrator's Guide.
Redaction is the process of eliminating or obscuring portions of a document when retrieving the document from MarkLogic. For example, you can eliminate or mask sensitive personal information such as credit card numbers, phone numbers, or email addresses from documents. You can only redact document content, not document properties.
Using redaction requires the Advanced Security License option.
Redaction support in MarkLogic is covered in detail in Redacting Document Content in the Application Developer's Guide. This section describes how to use mlcp as the redaction driver. This section includes the following topics:
Use the -redaction
option of mlcp to apply redaction rules to an export or copy operation. This option accepts a comma-separated list of redaction rule collection URIs. For example:
-redaction "pii-rules,sec-rules"
Before you can use redaction, you must install one or more redaction rule sets in the Schemas database. For details on defining and installing redaction rules, see Redacting Document Content in the Application Developer's Guide.
Preparing to redact documents with mlcp requires the following steps. For a complete example, see Example: Using mlcp for Redaction.
-redaction
option to your mlcp command line. For example, the following command applies the rules in the collections pii-rules and sec-rules to all exported documents.# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh export -host localhost -port 8000 -username user \ -password password -mode local -output_file_path \ /space/mlcp/export/files -directory_filter /people/ \ -redaction "pii-rules,sec-rules"
The -redaction
option works similarly for copy operations. For details, see Redacting Content During a Copy.
The user who extracts redacted documents must have read permissions on the source documents and the rules, but need not be able to modify the rule collection or rule definitions. For details, see Security Considerations in Application Developer's Guide.
The following behaviors apply when exceptional conditions occur. You should be aware of these behaviors so you understand when content might not be redacted as expected:
This example walks you through using mlcp to install and apply redaction rules based on the built-in redaction functions. For a similar example using XQuery and Query console, see Example: Getting Started With Redaction in the Application Developer's Guide.
The example has the following parts:
This example uses rules based on built-in redaction functions. For an example of using user-defined redaction functions, see User-Defined Redaction Functions in the Application Developer's Guide.
This example assumes the following directory hierarchy:
redact-gs/ data/ rules/
The data/
directory will hold the source documents. The rules/
directory will hold redaction rules. The example walks you through populating these directories and uploading the contents to MarkLogic using mlcp in preparation for exporting a set of redacted documents with mlcp.
Create the required directories on Linux by running the following command in a location of your choosing:
$ mkdir -p redact-gs/data redact-gs/rules
Create the required directories on Windows by running the following command in a location of your choice:
>mkdir redact-gs\data redact-gs\rules
When you complete this exercise, the Documents database should contain the following documents. The documents are inserted into a collection named gs-samples for easy reference.
Follow the steps in this procedure to install two sample documents in the Documents database.
redact-gs/data
directory.<personal> <name>Little Bopeep</name> <summary>Seeking lost sheep. Please call 123-456-7890.</summary> <id>12-3456789</id> </personal>
{"personal": { "name": "Jack Sprat", "summary": "Free nutrition advice! Call (234)567-8901 now!", "id": "45-6789123" }}
$ mlcp.sh import -host localhost -port 8000 \ -username user -password password -mode local \ -input_file_path . \ -output_uri_replace ".*/redact-gs/data/,'/redact-gs/'" \ -output_collections "gs-samples"
You can use Query Console to explore the Documents database and confirm the upload.
The use of -output_uri_replace
on the import command line replaces the portion of the default URI that is based on the filesystem location with the fixed directory prefix /rules/gs. For more details, see Controlling Database URIs During Ingestion.
Rules must be installed in the schemas database associated with your content database. Rules must also be part of a collection before you can use them. This section installs rules in the Schemas database, which is the default schemas database associated with the Documents database.
When you complete this exercise, the Schemas database should contain the following documents. The documents are inserted into a rule collection named gs-rules. Rules must be in a rule collection before you can apply them.
The rules installed in this step use the redact-us-phone and conceal built-in redaction functions. For details on these and other built-in redaction functions, see Built-in Redaction Function Reference in the Application Developer's Guide.
Follow the steps in this procedure to install two sample rules in the Schemas database. For an explanation of what the rules do, see Understanding the Example Rules.
rules
directory you created in Creating a Work Area. You should be in your redact-gs/rules
directory.<rule xml:lang="zxx" xmlns="http://marklogic.com/xdmp/redaction"> <description>Obscure phone numbers.</description> <path>//summary</path> <method> <function>redact-us-phone</function> </method> <options> <level>partial</level> </options> </rule>
{ "rule": { "description": "Remove customer ids.", "path": "//id", "method": { "function": "conceal" } }}
$ mlcp.sh import -host localhost -port 8000 \ -username user -password password -mode local \ -database Schemas -input_file_path . \ -output_uri_replace ".*/redact-gs/rules/,'/rules/gs/'" \ -output_collections "gs-rules"
You can use Query Console to explore the Schemas database and confirm the upload.
The use of -output_uri_replace
on the import command line replaces the portion of the default URI that is based on the filesystem location with the fixed directory prefix /rules/gs. For more details, see Controlling Database URIs During Ingestion.
The XML rule installed in Installing the Redaction Rules has the following form:
<rule xml:lang="zxx" xmlns="http://marklogic.com/xdmp/redaction"> <description>Obscure phone numbers.</description> <path>//summary</path> <method> <function>redact-us-phone</function> </method> <options> <level>partial</level> </options> </rule>
The rule elements have the following effect:
description
- Optional metadata for informational purposes.path
- Apply the redaction function specified by the rule to nodes selected by the path expression //summary.method
- Use the built-in redaction function redact-us-phone
to redact the value in a summary
XML element or JSON property. By default, this function replaces all digits in a phone number by the character #. You can tell this is a built-in function because method
has no module
child.options
- Pass a level
parameter value of partial to redact-us-phone, causing the function to leave the last 4 digits of the value unchanged.The expected result of applying this rule is that any text in the value of a node named summary that matches the pattern of a US phone number will be replaced. The replacement value uses the # number to replace all but the last 4 digits. For example, a value such as 123-456-7890 is redacted to ###-###-7890. For more details, see redact-us-phone in the Application Developer's Guide.
The JSON rule installed in Installing the Redaction Rules has the following form:
{ "rule": { "description": "Remove customer ids.", "path": "//id", "method": { "function": "conceal" } }}
The rule properties have the following effect:
description
- Optional metadata for informational purposes.path
- Apply the redaction function specified by the rule to nodes selected by the path expression //id
.method
- Use the built-in redaction function conceal
to redact the id
XML element or JSON property. This function will hide the nodes selected by path
. You can tell this is a built-in function because method
has no module
child.The expected result of applying this rule is to remove nodes named id
. For example, if //id
selects and XML element or JSON property, the element or property does not appear in the redacted output. Note that, if //id
selects array items in JSON, the items are eliminated, but the id
property might remain, depending on the structure of the document. For more details, see conceal in the Application Developer's Guide.
Run the following command from your redact-gs/
directory to export redacted versions of the sample documents. Modify the connection details as needed to match your environment. A collection filter (-collection_filter "gs-samples") is used to select the documents for redaction and export.
$ mlcp.sh export -host localhost -port 8000 \ -username user -password password -mode local \ -collection_filter "gs-samples" \ -output_file_path ./output/ \ -redaction "gs-rules"
Running the export command saves the redacted documents to an output/
sub-directory. You should have the following filesystem hierarch. The extra redact-gs sub-directory is created by mlcp because the document URIs are of the form /redact-s/
filename.
redact-gs/ output/ redact-gs/ sample1.xml sample2.json
The following table shows the result of redacting the XML sample document. Notice that the telephone number in the summary
element has been partially redacted by the redact-us-phone
function. Also, the id
element has been completely hidden by the conceal
function. The affected parts of the content are highlighted in the table.
The following table shows the result of redacting the JSON sample document. Notice that the telephone number in the summary
property has been partially redacted by the redact-us-phone
function. Also, the id
property has been completely hidden by the conceal
function. The affected parts of the content are highlighted in the table.
To redact documents when copying them between databases rather than exporting them, add the -redaction
option to the mlcp copy command line.
The mlcp tool uses the MarkLogic Connector for Hadoop to distribute work across your MarkLogic cluster, even when run in local mode. When you use the mlcp export command, MarkLogic Server acts as an input source for a Hadoop MapReduce job. The exported documents are the output of the job. You can take low level control of the job by setting connector and Hadoop configuration properties.
Setting low level configuration properties is an advanced technique. You should understand how to use the MarkLogic Connector for Hadoop before attempting this.
The following list describes some use cases in which you might choose to set low level configuration properties:
Similar use cases and techniques apply to copy operations. For details, see Advanced Document Selection for Copy.
The following table lists some connector and Hadoop configuration properties relevant to advanced configuration for export.
You can pass a connector configuration file through mlcp with the -conf
option. The -conf
option must appear after -options_file
(if present) and before any other mlcp options. The following example command demonstrates the -conf
option.
$ mlcp.sh export -conf conf.xml -host localhost -port 8000 \ -username user -password password -mode local \ -output_file_path /space/examples/exported \ -directory_filter /binaies/
The following example connector configuration file uses an XQuery split query (mapreduce.marklogic.input.splitquery
) to distribute the documents across export tasks, and an XQuery transformation query (mapreduce.marklogic.input.query
) that returns just the first 1000 bytes of each selected binary document.
<property> <name>mapreduce.marklogic.input.query</name> <value><![CDATA[ xquery version "1.0-ml"; declare namespace mlmr="http://marklogic.com/hadoop"; declare variable $mlmr:splitstart as xs:integer external; declare variable $mlmr:splitend as xs:integer external; for $doc in fn:doc()[$mlmr:splitstart to $mlmr:splitend] return xdmp:subbinary($doc/binary(), 1, 1000) ]]></value> </property> <property> <name>mapreduce.marklogic.input.splitquery</name> <value><![CDATA[ xquery version "1.0-ml"; import module namespace hadoop = "http://marklogic.com/xdmp/hadoop" at "/MarkLogic/hadoop.xqy"; hadoop:get-splits('', 'fn:doc()', '()') ]]></value> </property> <property> <name>mapreduce.marklogic.input.mode</name> <value>advanced</value> </property>
For more details and examples, see the MarkLogic Connector for Hadoop Developer's Guide.
This section summarizes the command line options available with the mlcp export
command. The following command line options define your connection to MarkLogic:
Option | Description |
---|---|
-host comma-list |
Required. A comma separated list of hosts through which mlcp can connect to the destination MarkLogic Server. You must specify at least one host. For more details, see How mlcp Uses the Host List. |
-port number |
Port number of the source MarkLogic Server. There should be an XDBC App Server on this port. Default: 8000. |
-username string |
MarkLogic Server user from which to export documents. Required, unless using Kerberos authentication. |
-password string |
Password for the MarkLogic Server user specified with -username . Required, unless using Kerberos authentication. |
The following table lists command line options that define the characteristics of the export
operation:
Option | Description |
---|---|
-collection_filter comma-list |
A comma-separated list of collection URIs. mlcp exports only documents in these collections, plus related metadata. This option may not be combined with -directory_filter or -document_selector . Default: All documents and related metadata. |
-compress boolean |
Whether or not to compress the output document. Only applicable when -output_type is document . Default: false . |
-conf filename |
Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options. |
-content_encoding string |
The character encoding of output documents when -input_file_type is documents . The option value must be a character set name accepted by your JVM; see java.nio.charset.Charset . Default: UTF-8 . Set to system to use the platform default encoding for the host on which mlcp runs. |
-copy_collections boolean |
When exporting documents to an archive, whether or not to copy collections to the destination. Default: true . |
-copy_metadata boolean |
When exporting documents to an archive, whether or not to copy key-value metadata to the destination. Default: true . |
-copy_permissions boolean |
When exporting documents to an archive, whether or not to copy document permissions to the destination. Default: true . |
-copy_properties boolean |
When exporting documents to an archive, whether or not to copy properties to the destination. Default: true . |
-copy_quality boolean |
When exporting documents to an archive, whether or not to copy document quality to the destination. Default: true . |
-D property=value |
Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options. |
-database string |
The name of the source database. Default: The database associated with the source App Server identified by -host and -port . |
-directory_filter comma-list |
A comma-separated list of database directory names. mlcp exports only documents from these directories, plus related metadata. Directory names should usually end with /. This option may not be combined with -collection_filter or -document_selector . Default: All documents and related metadata. |
-document_selector string |
Specifies an XPath expression used to select which documents are exported from the database. The XPath expression should select fragment roots. This option may not be combined with -directory_filter or -collection_filter . Default: All documents and related metadata. |
-hadoop_conf_dir string |
When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode. |
-indented boolean |
Whether to pretty-print XML output. Default: false . |
-max_split_size number |
The maximum number of document fragments processed per split. Default: 20000 in local mode, 50000 in distributed mode. |
-mode string |
Export mode. Accepted values: distributed , local . Distributed mode requires Hadoop. Default: local , unless you set the HADOOP_CONF_DIR variable; for details, see Configuring Distributed Mode. |
-options_file string |
Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax. |
-output_file_path string |
Destination directory where the archive or documents are saved. The directory must not already exist. |
-output_type string |
The type of output to produce. Accepted values: document , archive . Default: document . |
-path_namespace comma-list |
Specifies one or more namespace prefix bindings for namespace prefixes usable in path expressions passed to -document_selector . The list items should be alternating pairs of prefix names and namespace URIs, such as 'pfx1,http://my/ns1,pfx2,http://my/ns2' . |
-query_filter string |
Specifies a query to apply when selecting documents for export. The argument must be the XML serialization of a cts:query or JSON serialization of a cts.query. Only documents matching the query are considered for export; false positives are possible. For details, see Controlling What is Exported, Copied, or Extracted. |
-redaction comma-list |
Apply one or more redaction rule collections. The argument must be a comma-separated list of rule collection URIs. The rule collections must be installed in the schemas database. For details and example, see Redacting Content During Export or Copy Operations and Redacting Document Content in the Application Developer's Guide. |
-restrict_hosts boolean |
Restrict mlcp to connect to MarkLogic only through the hosts listed in the -host option. Default: false (no restriction). For more details, see Restricting the Hosts mlcp Uses to Connect to MarkLogic. |
-snapshot boolean |
Whether or not to export a consistent point-in-time snapshot of the database contents. Default: false . When true , the job submission time is used as the database read timestamp for selecting documents to export. For details, see Extracting a Consistent Database Snapshot. |
-ssl boolean |
Enable/disable SSL secured communication with MarkLogic. Default: false. If you set this option to true, your App Server must be SSL enabled. For details, see Connecting to MarkLogic Using SSL. |
-ssl_protocol string |
Specify the protocol mlcp should use when creating an SSL connection to MarkLogic. You must include this option if you use the -ssl option to connect to an App Server configured to disable MarkLogic's default protocol (TLS). Allowed values: tls , tlsv1 , tlsv1.1 , tlsv1.2 . Default: tls . This option is ignored if you use a Hadoop Connector conf file for SSL configuration; for details, see Advanced SSL Configuration. |
-thread_count number |
The number of threads to spawn for concurrent exporting. The total number of threads spawned by the process can be larger than this number, but this option caps the number of concurrent sessions with MarkLogic Server. Only available in local mode. Default: 4. |