mlcp User Guide — Chapter 7

Using Direct Access to Extract or Copy Documents

Direct Access enables you to bypass MarkLogic Server and extract documents from a database by reading them directly from the on-disk representation of a forest. This feature is best suited for accessing documents in archived, offline forests.

This section covers the following topics:

When to Consider Using Direct Access
Limitations of Direct Access
Choosing Between Export and Extract
Extracting Documents as Files
Importing Documents from a Forest into a Database
Extract Command Line Options

When to Consider Using Direct Access

Direct Access enables you to extract documents directly from an offline or read-only forest without going through MarkLogic Server. A forest is the internal representation of a collection of documents in a MarkLogic database; for details, see Understanding Forests in the Administrator's Guide. A database can span multiple forests on multiple hosts.

Direct Access is primarily intended for accessing archived data that is part of a tiered storage deployment; for details, see Tiered Storage in the Administrator's Guide. You should only use Direct Access on a forest that is offline or read-only; for details, see Limitations of Direct Access.

For example, if you have data that ages out over time such that you need to retain it, but you do not need to have it available for real time queries through MarkLogic Server, you can archive the data by taking the containing forests offline, but still access the contents using Direct Access.

Use Direct Access with mlcp to access documents in offline and read-only forests in the following ways:

The mlcp extract command to extracts archived documents from a database as flat files. This operation is similar to exporting documents from a database to files, but does not require a source MarkLogic Server instance. For details, see Choosing Between Export and Extract.
The mlcp import command with -input_file_type forest imports archived documents as to another database as live documents. A destination MarkLogic Server instance is required, but no source instance.

Since Direct Access bypasses the active data management performed by MarkLogic Server, you should not use it on forests receiving document updates. Additional restrictions apply. For details, see Limitations of Direct Access.

Limitations of Direct Access

You should only use Direct Access on a forest that meets one of the following criteria:

The forest is offline and not in an error state. A forest is offline if the availability is set to offline, or the forest or the database to which it is attached is disabled. For details, see Taking Forests and Partitions Online and Offline in the Administrator's Guide.
The forest is online, but the updates-allowed state of the forest is read-only. For details, see Setting the Updates-allowed State on Partitions in the Administrator's Guide.

The following additional limitations apply to using Direct Access:

Accessing documents with Direct Access bypasses security roles and privileges. The content is protected only by the filesystem permissions on the forest data.
Direct Access cannot take advantage of indexing or caching when accessing documents. Every document in each participating forest is read, even when you use filtering criteria such as -directory_filter or -type_filter. Filtering can only be applied after reading a document off disk.
Direct Access skips property fragments.
Direct Access skips documents partitioned into multiple fragments. For details, see Fragments in the Administrator's Guide.
Older versions of mlcp might not be able to read forest data from MarkLogic 9 or later. For best results, use the version of mlcp that corresponds to your MarkLogic version.

When you use Direct Access, mlcp skips any forest (or a stand within a forest) that is receiving updates or that is in an error state. Processing continues even when some documents are skipped.

When you use mlcp with Direct Access, your forest data must be reachable from the host(s) processing the input. In local mode, the forests must be reachable from the host on which you execute mlcp.

If mlcp accesses large or external binaries with Direct Access, then the reachability requirement also applies to the large data directory and any external binary directories. Furthermore, these directories must be reachable along the same path as when the forest was online.

Choosing Between Export and Extract

You can use the export and extract commands to save content in a MarkLogic database to files on the native file system. You should usually use export rather than extract. The extract command is best suited for archive data in offline or read-only forests. Otherwise, use the export command.

The extract command places no load on MarkLogic Server. The export command offloads most of the work to your MarkLogic cluster. Thus, export honors document permissions, takes advantage of database indexes, and can apply transformations and filtering at the server. By contrast, extract bypasses security (other than file permissions on the forest files), must access all document sequentially, and applies a limited set of filters on the client.

The export command offers a richer set of filtering options than extract. In addition, export only accesses the documents selected by your options, while extract must scan the entirety of each input forest, even when extracting selected documents.

For more information, see the following topics:

Extracting Documents as Files

Use the mlcp extract command to extract documents from archival forest files to files on the native filesystem. For example, you can extract an XML document as a text file containing XML, or a binary document as a JPG image.

To extract documents from a forest as files:

Set -input_file_path to the path to the input forest directory(s). Specify multiple forests using a comma-separated list of paths.
Select the documents to extract. For details, see Filtering Forest Contents.
- To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
- To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
- To select documents by document type, set -type_filter to a comma separated list of document types.
- To select all documents in the database, leave -collection_filter, -directory_filter, and -type_filter unset.
Set -output_file_path to the destination file or directory on the native filesystem. This directory must not already exist.
Set -mode to local:
- Your input forests must be reachable from the host where you execute mlcp.
If you want to extract the documents as files in compressed files, set -compress to true.

Filtering options can be combined. Directory names specified with -directory_filter should end with /. All filters are applied on the client, so every document is accessed, even if it is filtered out of the output document set.

Document URIs are URI-decoded before filesystem directories or filenames are constructed for them. For details, see How URI Decoding Affects Output File Names.

For a full list of extract options, see Extract Command Line Options.

The following example extracts selected documents from the forest files in /var/opt/MarkLogic/Forests/example to the native filesystem directory /space/mlcp/extracted/files. The directory filter selects only the input documents in the database directory /plays.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh extract -mode local \
    -input_file_path /var/opt/MarkLogic/Forests/example \
    -output_file_path /space/mlcp/extracted/files \
    -directory_filter /plays/

Importing Documents from a Forest into a Database

Use the following procedure to load all the files in a native forest directory and its sub-directories. To load selected files, see Filtering Documents Loaded From a Directory. For more details on the command line options used in this procedure, see Import Command Line Options.

Set -input_file_path to the path to the input forest directory(s). Specify multiple forests using a comma-separated list of paths.
Set -input_file_type to forest.
Specify the connection information for the destination database using -host, -port, -username, and -password.
Select the files to extract from the input forest. For details, see Filtering Forest Contents. Filtering options can be used together.
- To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
- To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
- To select documents by document type, set -type_filter to a comma separated list of document types.
- To select all documents in the database, leave -collection_filter, -directory_filter, and -type_filter unset.
If you want to exclude some or all of the document metadata in the forests:
- Set -copy_collections to false to exclude document collections metadata.
- Set -copy_quality to false to exclude document quality metadata.
- Set -copy_metadata to false to exclude key-value metadata.
Set -mode to local (This is the default mode):
- Your input forests and the destination MarkLogic Server instance must be reachable from the host where you run mlcp.

By default, an imported document has a database URI based on the input file path. You can customize the URI using options. For details, see Controlling Database URIs During Ingestion.

The following example command loads the documents in the forests in /var/opt/MarkLogic/Forests/example:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -input_file_type forest \
    -input_file_path /var/opt/MarkLogic/Forests/example

Extract Command Line Options

This section summarizes the command line options available with the mlcp extract command. An extract command requires the -input_file_path and -output_file_path options. That is, an extract command has the following form:

mlcp.sh extract -input_file_path forest-path \
    -output_file_path dest-path ...

The following table lists command line options that define the characteristics of the extraction:

Option	Description
-collection_filter comma-list	A comma-separated list of collection URIs. mlcp extracts only documents in these collections. This option can be combined with other filter options. Default: All documents.
-compress boolean	Whether or not to compress the output. mlcp might generate multiple compressed files. Default: `false`.
-directory_filter comma-list	A comma-separated list of database directory names. mlcp extracts only documents from these directories, plus related metadata. Directory names should usually end with /. This option can be combined with other filter options. Default: All documents and related metadata.
-max_split_size number	The maximum number of document fragments processed per split. Default: 50000.
-mode string	Export mode. Accepted values: `local`.
-options_file string	Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_file_path string	Destination directory where the documents are saved. The directory must not already exist.
-thread_count number	The number of threads to spawn for concurrent exporting. The total number of threads spawned by the process can be larger than this number, but this option caps the number of concurrent sessions with MarkLogic Server. Only available in `local` mode. Default: 4.
-type_filter comma-list	A comma-separated list of document types. mlcp extracts only documents with these types. This option can be combined with other filter options. Allowed documentypes: `xml`, `text`, `binary`. Default: All documents.