mlcp User Guide (PDF)

MarkLogic 9 Product Documentation
mlcp User Guide
— Chapter 5

« Previous chapter
Next chapter »

Exporting Content from MarkLogic Server

You can export content in a MarkLogic Server database to files or an archive. Use archives to copy content from one MarkLogic Server database to another. Output can be written to the native filesystem or to HDFS.

For a list of export related command line options, see Export Command Line Options.

You can also use mlcp to extract documents directly from offline forests. For details, see Using Direct Access to Extract or Copy Documents.

This section covers the following topics:

Exporting Documents as Files

Use the mlcp export command to export documents in their original format as files on the native filesystem or HDFS. For example, you can export an XML document as a text file containing XML, or a binary document as a JPG image.

To export documents from a database as files:

  1. Select the files to export. For details, see Filtering Document Exports.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select documents matching an XPath expression, use -document_selector. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace.
    • To select documents matching a query, use -query_filter, alone or in combination with one of the other filter options. False postives are possible; for details, see Understanding When Filters Are Accurate.
    • To select all documents in the database, leave -collection_filter, -directory_filter, -document_selector, and -query_filter unset.
  2. Set -output_file_path to the destination file or directory on the native filesystem or HDFS.
  3. To prettyprint exported XML when using local mode, set -indented to true.

Directory names specified with -directory_filter should end with /.

When using -document_selector to filter by XPath expression, you can define namespace prefixes using the -path_namespace option. For example:

-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2'
-document_selector '/ex1:elem[ex2:attr > 10]'

Document URIs are URI-decoded before filesystem directories or filenames are constructed for them. For details, see How URI Decoding Affects Output File Names.

For a full list of export options, see Export Command Line Options.

The following example exports selected documents in the database to the native filesystem directory /space/mlcp/export/files. The directory filter selects only the documents in /plays.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8000 -username user \
    -password password -mode local -output_file_path \
    /space/mlcp/export/files -directory_filter /plays/

Exporting Documents to a Compressed File

Use the mlcp export command to export documents in their original format as files in a compressed ZIP file on the native filesystem or HDFS.

To export documents from a database as files:

  1. Select the files to export. For details, see Filtering Document Exports.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select documents matching an XPath expression, use -document_selector. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace.
    • To select documents matching a query, use -query_filter, alone or in combination with one of the other filter options. False postives are possible; for details, see Understanding When Filters Are Accurate.
    • To select all documents in the database, leave -collection_filter, -directory_filter,-document_selector, and -query_filter unset.
  2. Set -output_file_path to the destination directory on the native filesystem or HDFS. This directory must not already exist.
  3. Set -compress to true.
  4. To prettyprint exported XML when using local mode, set -indented to true.

For a full list of export options, see Export Command Line Options.

The zip files created by export have filenames of the form timestamp-seqnum.zip.

The following example exports all the documents in the database to the directory /space/examples/export on the native filesystem.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8000 -username user \
    -password password -mode local \
    -output_file_path /space/examples/export -compress true$ ls /space/examples/export
20120823135307-0700-000000-XML.zip

Exporting to an Archive

Use the mlcp export command with an output type of archive to create a database archive that includes content and metadata. You can use the mlcp import command to copy the archive to another database or restore database contents.

To export database content to an archive file with mlcp:

  1. Select the documents to export. For details, see Filtering Archive and Copy Contents.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select documents matching an XPath expression, use -document_selector. To use namespace prefixes in the XPath expression, define the prefix binding using -path_namespace.
    • To select documents matching a query, use -query_filter, alone or in combination with one of the other filter options. False postives are possible; for details, see Understanding When Filters Are Accurate.
    • To select all documents in the database, leave -collection_filter, -directory_filter, -document_selector, and -query_filter unset.
  2. Set -output_file_path to the destination directory on the native filesystem or HDFS. This directory must not already exist.
  3. Set -output_type to archive.
  4. If you want to exclude some or all document metadata from the archive:
    • Set -copy_collections to false to exclude document collections metadata.
    • Set -copy_permissions to false to exclude document permissions metadata.
    • Set -copy_properties to false to exclude document properties.
    • Set -copy_quality to false to exclude document quality metadata.
    • Set -copy_metadata to false to exclude document key-value metadata.

For a full list of export options, see Export Command Line Options.

The following example exports all documents and metadata to the directory /space/examples/exported. After export, the directory contains one or more compressed archive files.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8000 -username user \
    -password password -mode local \
    -output_file_path /space/examples/exported -output_type archive

The following example exports only documents in the database directory /plays/, including their collections, properties, and quality, but excluding permissions:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8000 -username user \
    -password password -mode local \
    -output_file_path /space/examples/exported -output_type archive \
    -copy_permissions false -directory_filter /plays/

You can use the mlcp import command to import an archive into a database. For details, see Loading Content and Metadata From an Archive.

How URI Decoding Affects Output File Names

This discussion only applies when -output_type is document.

When you export a document to a file (or to a file in a compressed file), the output file name is based on the document URI. The document URI is decoded to form the file name. For example, if the document URI is foo%20bar.xml, then the output file name is foo bar.xml.

If the document URI does not conform to the standard URI syntax of RFC 3986, decoding may fail, resulting in unexpected file names. For example, if the document URI contains unescaped special characters then the raw URI may be used.

If the document URI contains a scheme, the scheme is removed. If the URI contains both a scheme and an authority, both are removed. For example, if the document URI is file:foo/bar.xml, then the output file path is output_file_path/foo/bar.xml. If the document URI is http://marklogic.com/examples/bar.xml (contains a scheme and an authority), then the output file path is output_file_path/examples/bar.xml.

If the document URI includes directory steps, then corresponding output subdirectories are created. For example, if the document URI is /foo/bar.xml, then the output file path is output_file_path/foo/bar.xml.

Controlling What is Exported, Copied, or Extracted

By default, mlcp exports all documents or all documents and metadata in the database, depending on whether you are exporting in document or archive format or copying the database. Several command line options are available to enable customization. This section covers the following topics:

Filtering Document Exports

This section covers options available for filtering what is exported by the mlcp export command when -output_type is document.

By default, mlcp exports all documents in the database. That is, mlcp exports the equivalent of fn:collection(). The following options allow you to filter what is exported. These options are mutually exclusive.

  • -directory_filter - export only the documents in the listed database directories. You cannot use this option with -collection_filter or -document-selector.
  • -collection_filter - export only the documents in the listed collections. You cannot use this option with -directory_filter or -document_selector.
  • -document_selector - export only documents selected by the specified XPath expression. You cannot use this option with -directory_filter or -collection_filter. Use -path_namespace to define namespace prefixes.
  • -query_filter - export only documents matched by the specified cts query. You can use this option alone or in combination with a directory, collection or document selector filter. You can only use this filter with the export and copy commands. Results may not be accurate; for details, see Understanding When Filters Are Accurate.

    When filtering with a document selector, the XPath filtering expression should select fragment roots only. An XPath expression that selects nodes below the root is very inefficient.

When using -document_selector to filter by XPath expression, you can define namespace prefixes using the -path_namespace option. For example:

-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2'
-document_selector '/ex1:elem[ex2:attr > 10]'

Filtering Archive and Copy Contents

This section covers options available for controlling what is exported by mlcp export when -output_type is archive, or what is copied by the mlcp copy command.

By default, all documents and metadata are exported/copied. The following options allow you to modify this behavior:

  • -directory_filter - export/copy only the documents in the listed database directories, including related metadata. You cannot use this option with -collection_filter or -document_selector.
  • -collection_filter - export/copy only the documents in the listed collections, including related metadata. You cannot use this options with -directory_filter or -document_selector.
  • -document_selector - export/copy only documents selected by the specified XPath expression.You cannot use this option with -directory_filter or -collection_filter. Use -path_namespace to define namespace prefixes.
  • -query_filter - export/copy only documents matched by the specified cts query. You can use this option alone or in combination with a directory, collection or document selector filter. Results may not be accurate; for details, see Understanding When Filters Are Accurate.
  • -copy_collections - whether to include collection metadata
  • -copy_permissions - whether to include permissions metadata
  • -copy_properties - whether to include naked and document properties
  • -copy_quality - whether to include document quality metadata
  • -copy_metadata - whether to include document key-value metadata

If you set all the -copy_* options to false when exporting to an archive, the archive contains no metadata. When you import an archive with no metadata, you must set -archive_metadata_optional to true.

When filtering with a document selector, the XPath filtering expression should select fragment roots only. An XPath expression that selects nodes below the root is very inefficient.

When using -document_selector to filter by XPath expression, you can define namespace prefixes using the -path_namespace option. For example:

-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2'
-document_selector '/ex1:elem[ex2:attr > 10]'

Understanding When Filters Are Accurate

When you use -directory_filter, -collection_filter, or -document_selector without -query_filter, the set of documents selected by mlcp exactly matches your filtering criteria.

The query you supply with -query_filter is used in an unfiltered search, which means there can be false positives among the selected documents. When you combine -query_filter with -directory_filter, -collection_filter, or -document_selector, mlcp might select documents that do not meet your directory, collection, or path filter criteria.

The interaction between -query_filter and the other filtering options is similar to the following. In this example, the search can match documents that are not in the parts collection.

-collection_filter parts 
-query_filter yourSerializedQuery
==> selects the documents to export similar to the following:

cts:search(
  fn:collection("parts"), 
  yourQuery, 
  ("unfiltered"))

For a complete example using -query_filter, see Example: Exporting Documents Matching a Query.

To learn more about the implications of unfiltered searches, see Fast Pagination and Unfiltered Searches in the Query Performance and Tuning Guide.

Example: Exporting Documents Matching a Query

This example demonstrates how to use -query_filter to select documents for export. You can apply the same technique to filtering the source documents when copying documents from one database to another.

The -query_filter option accepts a serialized XML cts:query or JSON cts.query as its value. For example, the following table shows the serialization of a cts word query, prettyprinted for readability:

Format Example
XML
<cts:word-query xmlns:cts="http://marklogic.com/cts">
  <cts:text xml:lang="en">mark</cts:text>
</cts:word-query>
JSON
{"wordQuery":{
  "text":["huck"], 
  "options":["lang=en"]
}}

For details on how to obtain the serialized representation of a cts query, see Serializations of cts:query Constructors in the Search Developer's Guide.

Using an options file is recommended when using -query_filter because both XML and JSON serialized queries contain quotes and other characters that have special meaning to the Unix and Windows command shells, making it challenging to properly escape the query. If you use -query_filter on the command line, you must quote the serialized query and may need to do additional special character escaping.

For example, you can create an options file similar to the following. It should contain at least 2 lines: One for the option name and one for the serialized query. You can include other options in the file. For details, see Options File Syntax.

Format Options File Contents
XML
-query_filter
<cts:word-query xmlns:cts="http://marklogic.com/cts"><cts:text xml:lang="en">mark</cts:text></cts:word-query>
JSON
-query_filter
{"wordQuery":{"text":["huck"], "options":["lang=en"]}}

If you save the above option in a file named query_filter.txt, then the following mlcp command exports files from the database that contain the word huck:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh export -host localhost -port 8000 -username user \
    -password password -mode local -output_file_path \
    /space/mlcp/export/files -options_file query_filter.txt

You can combine -query_filter with another filtering option. For example, the following command combines the query with a collection filter. The command exports only documents containing the word huck in the collection named classics:

$ mlcp.sh export -host localhost -port 8000 -username user \
    -password password -mode local -output_file_path \
    /space/mlcp/export/files -options_file query_filter.txt
    -collection_filter classics

The documents selected by -query_filter can include false positives, including documents that do not match other filter criteria. For details, see Understanding When Filters Are Accurate.

The following example demonstrates generating a serialized XML cts:and-query or JSON cts.andQuery using the wrapper technique. Copy either example into Query Console, select the appropriate query type, and run it to see the output.

Language Example
XQuery
xquery version "1.0-ml";
let $query := cts:and-query((
  cts:word-query("mark"), 
  cts:word-query("twain")
))
let $q := xdmp:quote(
  <query>{$query}</query>/*, 
  <options xmlns="xdmp:quote"><indent>no</indent></options>
)
return $q

(: Output: (whitespace added for readability)
<cts:and-query xmlns:cts="http://marklogic.com/cts">
  <cts:word-query>
    <cts:text xml:lang="en">mark</cts:text>
  </cts:word-query>
  <cts:word-query>
    <cts:text xml:lang="en">twain</cts:text>
  </cts:word-query>
</cts:and-query>
:)
Server-Side JavasScript
var wrapper = 
  { query:
      cts.andQuery([
        cts.wordQuery("huck"),
        cts.wordQuery("tom")])
  };
xdmp.quote(wrapper.query.toObject())

/* Output: (whitespace added for readability)
{"andQuery":{
  "queries":[
    {"wordQuery":{"text":["huck"], "options":["lang=en"]}},
    {"wordQuery":{"text":["tom"], "options":["lang=en"]}}
  ]
}}
*/

Notice that in the XML example, the xdmp:quote indent option is used to disable XML prettyprinting, making the output better suited for inclusion on the mlcp command line:

xdmp:quote(
  <query>{$query}</query>/*, 
  <options xmlns="xdmp:quote"><indent>no</indent></options>
)

Notice that in the JavaScript example, it is necessary to call toObject on the wrapped query to get the proper JSON serialization. Using toObject converts the value to a JavaScript object which xdmp.quote will serialize as JSON.

xdmp.quote(wrapper.query.toObject())

If you want to test your serialized query before using it with mlcp, you can round-trip your XML query with cts:search in XQuery or your JSON query with cts.search or the JSearch API in Server-Side JavaScript, as shown in the following examples.

Language Example
XQuery
xquery version "1.0-ml";
let $wrapper := 
  <query>{
    cts:and-query((
      cts:word-query("tom"),
      cts:word-query("huck")))
  }</query>
let $q := xdmp:quote(
  $wrapper/*, 
  <options xmlns="xdmp:quote"><indent>no</indent></options>)
return cts:search(
  fn:doc(), 
  cts:query(xdmp:unquote($q)/*[1])
)
Server-Side JavasScript
var wrapper = 
  { query:
      cts.andQuery([
        cts.wordQuery("huck"),
        cts.wordQuery("tom")])
  };
var serializedQ = xdmp.quote(wrapper.query.toObject())
cts.search(
  cts.query(fn.head(xdmp.unquote(serializedQ)).root))

Note that xdmp:unquote returns a document node in XQuery, so you need to use XPath to address the underlying query element root node when reconstructing the query:

cts:query(xdmp:unquote($q)/*[1])

Similarly, xdmp.unquote in JavaScript returns a Sequence on document nodes, so you must dereference both the iterator and the document node when reconstructing the query:

cts.query(fn.head(xdmp.unquote(serializedQ)).root)

Filtering Forest Contents

This section covers options available for filtering what is extracted from from a forest when you use Direct Access. That is, when you use the mlcp import command with -input_file_type forest or the mlcp extract command.

By default, mlcp extracts all documents in the input forests. That is, mlcp extracts the equivalent of fn:collection(). The following options allow you to filter what is extracted from a forest with Direct Access. These options can be combined.

  • -type_filter: Extract only documents with the listed content type (text, XML, or binary).
  • -directory_filter: Extract only the documents in the listed database directories.
  • -collection_filter: Extract only the documents in the listed collections.

For example, following combination of options extracts only XML documents in the collections named 2004 or 2005.

mlcp.sh extract -type_filter xml -collection_filter "2004,2005" ...

Similarly, the following options import only binary documents in the source database directory /images/:

mlcp.sh import -input_file_type forest \
    -type_filter binary -directory_filter /images/

When you use Direct Access, filtering is performed in the process that reads the forest files rather than being performed by MarkLogic Server. For example, in local mode, filters are applied by mlcp on the host where you run it; in distributed mode, filters are applied by each Hadoop task that reads in forest data.

In addition, filtering cannot be applied until after a document is read from the forest. When you import or extract files from a forest file, mlcp must touch every document in the forest.

For details, see Using Direct Access to Extract or Copy Documents.

Extracting a Consistent Database Snapshot

By default, when you export or copy database contents, content is extracted from the source database at multiple points in time. You get whatever is in the database when mlcp accesses a given document. If the database contents are changing while the job runs, the results are not deterministic relative to the starting time of the job. For example, if a new document is inserted into the database while an export job is running, it might or might not be included in the export.

If you require a consistent snapshot of the database contents during an export or copy, use the -snapshot option to force all documents to be read from the database at a consistent point in time. The submission time of the job is used as the timestamp. Any changes to the database occurring after this time are not reflected in the output.

If a merge occurs while exporting or copying a consistent snapshot, and the merge eliminates a fragment that is subsequently accessed by the mlcp job, you may get an XDMP-OLDSTAMP error. If this occurs, the documents included in the same batch or task may not be included in the export/copy result. If the source database is on MarkLogic Server 7 or later, you may be able to work around this problem by setting the merge timestamp to retain fragments for a time period longer than the expected running time of the job; for details, see Understanding and Controlling Database Merges in the Administrator's Guide.

Redacting Content During Export or Copy Operations

Redaction is the process of eliminating or obscuring portions of a document when retrieving the document from MarkLogic. For example, you can eliminate or mask sensitive personal information such as credit card numbers, phone numbers, or email addresses from documents. You can only redact document content, not document properties.

Using redaction requires the Advanced Security License option.

Redaction support in MarkLogic is covered in detail in Redacting Document Content in the Application Developer's Guide. This section describes how to use mlcp as the redaction driver. This section includes the following topics:

Basic Steps for Redacting Documents

Use the -redaction option of mlcp to apply redaction rules to an export or copy operation. This option accepts a comma-separated list of redaction rule collection URIs. For example:

-redaction "pii-rules,sec-rules"

Before you can use redaction, you must install one or more redaction rule sets in the Schemas database. For details on defining and installing redaction rules, see Redacting Document Content in the Application Developer's Guide.

Preparing to redact documents with mlcp requires the following steps. For a complete example, see Example: Using mlcp for Redaction.

  1. Install one or more redaction rules in the Schemas database. Each rule must be part of at least one collection. For details, see Defining Redaction Rules and Installing Redaction Rules in the Application Developer's Guide.
  2. If you create a rule that uses a user-defined redaction function, install the implementation of your redaction function in the modules database associated with the App Server you will connect to using mlcp. For details, see User-Defined Redaction Functions in the Application Developer's Guide.
  3. Add the -redaction option to your mlcp command line. For example, the following command applies the rules in the collections pii-rules and sec-rules to all exported documents.
    # Windows users, see Modifying the Example Commands for Windows 
    $ mlcp.sh export -host localhost -port 8000 -username user \
        -password password -mode local -output_file_path \
        /space/mlcp/export/files -directory_filter /people/ \
        -redaction "pii-rules,sec-rules"

The -redaction option works similarly for copy operations. For details, see Redacting Content During a Copy.

The user who extracts redacted documents must have read permissions on the source documents and the rules, but need not be able to modify the rule collection or rule definitions. For details, see Security Considerations in Application Developer's Guide.

The following behaviors apply when exceptional conditions occur. You should be aware of these behaviors so you understand when content might not be redacted as expected:

  • If a rule collection is empty, mlcp issues a warning and continues with the job.
  • If any of the rules contain errors, an error is reported and mlcp aborts the export or copy operation.
  • If a rule is valid, but an error occurs when applying the rule, the rule is skipped for the current document and a warning is logged. The job continues.

Example: Using mlcp for Redaction

This example walks you through using mlcp to install and apply redaction rules based on the built-in redaction functions. For a similar example using XQuery and Query console, see Example: Getting Started With Redaction in the Application Developer's Guide.

The example has the following parts:

This example uses rules based on built-in redaction functions. For an example of using user-defined redaction functions, see User-Defined Redaction Functions in the Application Developer's Guide.

Creating a Work Area

This example assumes the following directory hierarchy:

redact-gs/
  data/
  rules/

The data/ directory will hold the source documents. The rules/ directory will hold redaction rules. The example walks you through populating these directories and uploading the contents to MarkLogic using mlcp in preparation for exporting a set of redacted documents with mlcp.

Create the required directories on Linux by running the following command in a location of your choosing:

$ mkdir -p redact-gs/data redact-gs/rules

Create the required directories on Windows by running the following command in a location of your choice:

>mkdir redact-gs\data redact-gs\rules
Installing the Source Documents

When you complete this exercise, the Documents database should contain the following documents. The documents are inserted into a collection named gs-samples for easy reference.

  • /redact-gs/sample1.xml
  • /redact-gs/sample2.json

Follow the steps in this procedure to install two sample documents in the Documents database.

  1. Change directory to the data directory you created in Creating a Work Area. You should be in your redact-gs/data directory.
  2. Copy the following text into a file named sample1.xml:
    <personal>
      <name>Little Bopeep</name>
      <summary>Seeking lost sheep. Please call 123-456-7890.</summary>
      <id>12-3456789</id>
    </personal>
  3. Copy the following text into a file name sample2.json:
    {"personal": {
      "name": "Jack Sprat", 
      "summary": "Free nutrition advice! Call (234)567-8901 now!",
      "id": "45-6789123"
    }}
  4. Run the following mlcp command to insert the sample documents into the Documents database. Modify the connection details as needed to match your environment.
    $ mlcp.sh import -host localhost -port 8000 \
        -username user -password password -mode local \
        -input_file_path . \
        -output_uri_replace ".*/redact-gs/data/,'/redact-gs/'" \
        -output_collections "gs-samples"

You can use Query Console to explore the Documents database and confirm the upload.

The use of -output_uri_replace on the import command line replaces the portion of the default URI that is based on the filesystem location with the fixed directory prefix /rules/gs. For more details, see Controlling Database URIs During Ingestion.

Installing the Redaction Rules

Rules must be installed in the schemas database associated with your content database. Rules must also be part of a collection before you can use them. This section installs rules in the Schemas database, which is the default schemas database associated with the Documents database.

When you complete this exercise, the Schemas database should contain the following documents. The documents are inserted into a rule collection named gs-rules. Rules must be in a rule collection before you can apply them.

  • /rules/gs/redact-phone.xml
  • /rules/gs/conceal-id.json

The rules installed in this step use the redact-us-phone and conceal built-in redaction functions. For details on these and other built-in redaction functions, see Built-in Redaction Function Reference in the Application Developer's Guide.

Follow the steps in this procedure to install two sample rules in the Schemas database. For an explanation of what the rules do, see Understanding the Example Rules.

  1. Change directory to the rules directory you created in Creating a Work Area. You should be in your redact-gs/rules directory.
  2. Copy the following text into a file named redact-phone.xml.
    <rule xml:lang="zxx" xmlns="http://marklogic.com/xdmp/redaction">
      <description>Obscure phone numbers.</description>
      <path>//summary</path>
      <method>
        <function>redact-us-phone</function>
      </method>
      <options>
        <level>partial</level>
      </options>
    </rule>
  3. Copy the following text into a file name conceal-id.json:
    { "rule": {
        "description": "Remove customer ids.",
        "path": "//id",
        "method": { "function": "conceal" }
    }}
  4. Run the following mlcp command to insert the rules into the Schemas database. Modify the connection details as needed to match your environment.
    $ mlcp.sh import -host localhost -port 8000 \
        -username user -password password -mode local \
        -database Schemas -input_file_path . \
        -output_uri_replace ".*/redact-gs/rules/,'/rules/gs/'" \
        -output_collections "gs-rules"

You can use Query Console to explore the Schemas database and confirm the upload.

The use of -output_uri_replace on the import command line replaces the portion of the default URI that is based on the filesystem location with the fixed directory prefix /rules/gs. For more details, see Controlling Database URIs During Ingestion.

Understanding the Example Rules

The XML rule installed in Installing the Redaction Rules has the following form:

<rule xml:lang="zxx" xmlns="http://marklogic.com/xdmp/redaction">
  <description>Obscure phone numbers.</description>
  <path>//summary</path>
  <method>
    <function>redact-us-phone</function>
  </method>
  <options>
    <level>partial</level>
  </options>
</rule>

The rule elements have the following effect:

  • description - Optional metadata for informational purposes.
  • path - Apply the redaction function specified by the rule to nodes selected by the path expression //summary.
  • method - Use the built-in redaction function redact-us-phone to redact the value in a summary XML element or JSON property. By default, this function replaces all digits in a phone number by the character #. You can tell this is a built-in function because method has no module child.
  • options - Pass a level parameter value of partial to redact-us-phone, causing the function to leave the last 4 digits of the value unchanged.

The expected result of applying this rule is that any text in the value of a node named summary that matches the pattern of a US phone number will be replaced. The replacement value uses the # number to replace all but the last 4 digits. For example, a value such as 123-456-7890 is redacted to ###-###-7890. For more details, see redact-us-phone in the Application Developer's Guide.

The JSON rule installed in Installing the Redaction Rules has the following form:

{ "rule": {
    "description": "Remove customer ids.",
    "path": "//id",
    "method": { "function": "conceal" }
}}

The rule properties have the following effect:

  • description - Optional metadata for informational purposes.
  • path - Apply the redaction function specified by the rule to nodes selected by the path expression //id.
  • method - Use the built-in redaction function conceal to redact the id XML element or JSON property. This function will hide the nodes selected by path. You can tell this is a built-in function because method has no module child.

The expected result of applying this rule is to remove nodes named id. For example, if //id selects and XML element or JSON property, the element or property does not appear in the redacted output. Note that, if //id selects array items in JSON, the items are eliminated, but the id property might remain, depending on the structure of the document. For more details, see conceal in the Application Developer's Guide.

Applying the Redaction Rules

Run the following command from your redact-gs/ directory to export redacted versions of the sample documents. Modify the connection details as needed to match your environment. A collection filter (-collection_filter "gs-samples") is used to select the documents for redaction and export.

$ mlcp.sh export -host localhost -port 8000 \
    -username user -password password -mode local \
    -collection_filter "gs-samples" \
    -output_file_path ./output/ \
    -redaction "gs-rules"

Running the export command saves the redacted documents to an output/ sub-directory. You should have the following filesystem hierarch. The extra redact-gs sub-directory is created by mlcp because the document URIs are of the form /redact-s/filename.

redact-gs/
  output/
    redact-gs/
      sample1.xml
      sample2.json

The following table shows the result of redacting the XML sample document. Notice that the telephone number in the summary element has been partially redacted by the redact-us-phone function. Also, the id element has been completely hidden by the conceal function. The affected parts of the content are highlighted in the table.

Stage XML Content
Original Document
<personal>
  <name>Little Bopeep</name>
  <summary>Seeking lost sheep. Please call 123-456-7890.</summary>
  <id>12-3456789</id>
</personal>
Redacted Result
<personal>
  <name>Little Bopeep</name>
  <summary>Seeking lost sheep. Please call ###-###-7890.</summary>
</personal>

The following table shows the result of redacting the JSON sample document. Notice that the telephone number in the summary property has been partially redacted by the redact-us-phone function. Also, the id property has been completely hidden by the conceal function. The affected parts of the content are highlighted in the table.

Stage JSON Content
Original Document
{"personal": {
  "name": "Jack Sprat", 
  "summary": "Free nutrition advice! Call (234)567-8901 now!",
  "id": "45-6789123"
}}
Redacted Result
{"personal": {
  "name": "Jack Sprat", 
  "summary": "Free nutrition advice! Call (###)###-8901 now!"
}}

To redact documents when copying them between databases rather than exporting them, add the -redaction option to the mlcp copy command line.

Advanced Document Selection and Transformation

The mlcp tool uses the MarkLogic Connector for Hadoop to distribute work across your MarkLogic cluster, even when run in local mode. When you use the mlcp export command, MarkLogic Server acts as an input source for a Hadoop MapReduce job. The exported documents are the output of the job. You can take low level control of the job by setting connector and Hadoop configuration properties.

Setting low level configuration properties is an advanced technique. You should understand how to use the MarkLogic Connector for Hadoop before attempting this.

The following list describes some use cases in which you might choose to set low level configuration properties:

Similar use cases and techniques apply to copy operations. For details, see Advanced Document Selection for Copy.

The following table lists some connector and Hadoop configuration properties relevant to advanced configuration for export.

Configuration Property Description
mapreduce.marklogic.input.mode Controls whether the connectors runs in basic or advanced mode. Set to advanced.
mapreduce.marklogic.input.splitquery A query that generates input splits. This distributes work across export tasks. The query can be either XQuery or Server-Side JavaScript.
mapreduce.marklogic.input.query A query that selects the input fragments to export. You can use the input query to apply server-side transformations to each output item. The query can be either XQuery or Server-Side JavaScript.
mapreduce.job.inputformat.class

Optional. You do not need to set this property unless your input query produces something other than documents.

This property identifies a subclass of the connector InputFormat class, describing the type of the values produced by the input query. You can create your own InputFormat subclass, but most applications will use one of the classes defined by the connector, such as DocumentInputFormat, which is the default used by mlcp.

You can pass a connector configuration file through mlcp with the -conf option. The -conf option must appear after -options_file (if present) and before any other mlcp options. The following example command demonstrates the -conf option.

$ mlcp.sh export -conf conf.xml -host localhost -port 8000 \
    -username user -password password -mode local \
    -output_file_path /space/examples/exported \
    -directory_filter /binaies/

The following example connector configuration file uses an XQuery split query (mapreduce.marklogic.input.splitquery) to distribute the documents across export tasks, and an XQuery transformation query (mapreduce.marklogic.input.query) that returns just the first 1000 bytes of each selected binary document.

<property>
  <name>mapreduce.marklogic.input.query</name>
  <value><![CDATA[
    xquery version "1.0-ml"; 
    declare namespace mlmr="http://marklogic.com/hadoop";
    declare variable $mlmr:splitstart as xs:integer external;
    declare variable $mlmr:splitend as xs:integer external;
    for $doc in fn:doc()[$mlmr:splitstart to $mlmr:splitend]
    return xdmp:subbinary($doc/binary(), 1, 1000)
  ]]></value>
</property>
<property>
  <name>mapreduce.marklogic.input.splitquery</name>
  <value><![CDATA[
    xquery version "1.0-ml"; 
    import module namespace hadoop = "http://marklogic.com/xdmp/hadoop" 
      at "/MarkLogic/hadoop.xqy"; 
    hadoop:get-splits('', 'fn:doc()', '()')
  ]]></value>
</property>
<property>
  <name>mapreduce.marklogic.input.mode</name>
  <value>advanced</value>
</property>

For more details and examples, see the MarkLogic Connector for Hadoop Developer's Guide.

Export Command Line Options

This section summarizes the command line options available with the mlcp export command. The following command line options define your connection to MarkLogic:

Option Description
-host comma-list
Required. A comma separated list of hosts through which mlcp can connect to the destination MarkLogic Server. You must specify at least one host. For more details, see How mlcp Uses the Host List.
-port number
Port number of the source MarkLogic Server. There should be an XDBC App Server on this port. Default: 8000.
-username string
MarkLogic Server user from which to export documents. Required, unless using Kerberos authentication.
-password string
Password for the MarkLogic Server user specified with -username. Required, unless using Kerberos authentication.

The following table lists command line options that define the characteristics of the export operation:

Option Description
-collection_filter comma-list
A comma-separated list of collection URIs. mlcp exports only documents in these collections, plus related metadata. This option may not be combined with -directory_filter or -document_selector. Default: All documents and related metadata.
-compress boolean
Whether or not to compress the output document. Only applicable when -output_type is document. Default: false.
-conf filename
Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-content_encoding string
The character encoding of output documents when -input_file_type is documents. The option value must be a character set name accepted by your JVM; see java.nio.charset.Charset. Default: UTF-8. Set to system to use the platform default encoding for the host on which mlcp runs.
-copy_collections boolean
When exporting documents to an archive, whether or not to copy collections to the destination. Default: true.
-copy_metadata boolean
When exporting documents to an archive, whether or not to copy key-value metadata to the destination. Default: true.
-copy_permissions boolean
When exporting documents to an archive, whether or not to copy document permissions to the destination. Default: true.
-copy_properties boolean
When exporting documents to an archive, whether or not to copy properties to the destination. Default: true.
-copy_quality boolean
When exporting documents to an archive, whether or not to copy document quality to the destination. Default: true.
-D property=value
Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-database string
The name of the source database. Default: The database associated with the source App Server identified by -host and -port.
-directory_filter comma-list
A comma-separated list of database directory names. mlcp exports only documents from these directories, plus related metadata. Directory names should usually end with /. This option may not be combined with -collection_filter or -document_selector. Default: All documents and related metadata.
-document_selector string
Specifies an XPath expression used to select which documents are exported from the database. The XPath expression should select fragment roots. This option may not be combined with -directory_filter or -collection_filter. Default: All documents and related metadata.
-hadoop_conf_dir string
When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode.
-indented boolean
Whether to pretty-print XML output. Default: false.
-max_split_size number
The maximum number of document fragments processed per split. Default: 20000 in local mode, 50000 in distributed mode.
-mode string
Export mode. Accepted values: distributed, local. Distributed mode requires Hadoop. Default: local, unless you set the HADOOP_CONF_DIR variable; for details, see Configuring Distributed Mode.
-options_file string
Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_file_path string
Destination directory where the archive or documents are saved. The directory must not already exist.
-output_type string
The type of output to produce. Accepted values: document, archive. Default: document.
-path_namespace comma-list
Specifies one or more namespace prefix bindings for namespace prefixes usable in path expressions passed to -document_selector. The list items should be alternating pairs of prefix names and namespace URIs, such as 'pfx1,http://my/ns1,pfx2,http://my/ns2'.
-query_filter string
Specifies a query to apply when selecting documents for export. The argument must be the XML serialization of a cts:query or JSON serialization of a cts.query. Only documents matching the query are considered for export; false positives are possible. For details, see Controlling What is Exported, Copied, or Extracted.
-redaction comma-list
Apply one or more redaction rule collections. The argument must be a comma-separated list of rule collection URIs. The rule collections must be installed in the schemas database. For details and example, see Redacting Content During Export or Copy Operations and Redacting Document Content in the Application Developer's Guide.
-restrict_hosts boolean
Restrict mlcp to connect to MarkLogic only through the hosts listed in the -host option. Default: false (no restriction). For more details, see Restricting the Hosts mlcp Uses to Connect to MarkLogic.
-snapshot boolean
Whether or not to export a consistent point-in-time snapshot of the database contents. Default: false. When true, the job submission time is used as the database read timestamp for selecting documents to export. For details, see Extracting a Consistent Database Snapshot.
-ssl boolean
Enable/disable SSL secured communication with MarkLogic. Default: false. If you set this option to true, your App Server must be SSL enabled. For details, see Connecting to MarkLogic Using SSL.
-ssl_protocol string
Specify the protocol mlcp should use when creating an SSL connection to MarkLogic. You must include this option if you use the -ssl option to connect to an App Server configured to disable MarkLogic's default protocol (TLS). Allowed values: tls, tlsv1, tlsv1.1, tlsv1.2. Default: tls. This option is ignored if you use a Hadoop Connector conf file for SSL configuration; for details, see Advanced SSL Configuration.
-thread_count number
The number of threads to spawn for concurrent exporting. The total number of threads spawned by the process can be larger than this number, but this option caps the number of concurrent sessions with MarkLogic Server. Only available in local mode. Default: 4.

« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy