Loading TOC...
mlcp User Guide (PDF)

mlcp User Guide — Chapter 7

Using Direct Access to Extract or Copy Documents

Direct Access enables you to bypass MarkLogic Server and extract documents from a database by reading them directly from the on-disk representation of a forest. This feature is best suited for accessing documents in archived, offline forests.

This section covers the following topics:

When to Consider Using Direct Access

Direct Access enables you to extract documents directly from an offline or read-only forest without going through MarkLogic Server. A forest is the internal representation of a collection of documents in a MarkLogic database; for details, see Understanding Forests in the Administrator's Guide. A database can span multiple forests on multiple hosts.

Direct Access is primarily intended for accessing archived data that is part of a tiered storage deployment; for details, see Tiered Storage in the Administrator's Guide. You should only use Direct Access on a forest that is offline or read-only; for details, see Limitations of Direct Access.

For example, if you have data that ages out over time such that you need to retain it, but you do not need to have it available for real time queries through MarkLogic Server, you can archive the data by taking the containing forests offline, but still access the contents using Direct Access.

Use Direct Access with mlcp to access documents in offline and read-only forests in the following ways:

  • The mlcp extract command to extracts archived documents from a database as flat files. This operation is similar to exporting documents from a database to files, but does not require a source MarkLogic Server instance. For details, see Choosing Between Export and Extract.
  • The mlcp import command with -input_file_type forest imports archived documents as to another database as live documents. A destination MarkLogic Server instance is required, but no source instance.

You will likely get the best performance out of these operations if you use mlcp in distributed mode and already use HDFS for forest storage. Otherwise, the client where you execute mlcp can become a bottleneck due to resource limitations.

Since Direct Access bypasses the active data management performed by MarkLogic Server, you should not use it on forests receiving document updates. Additional restrictions apply. For details, see Limitations of Direct Access.

Limitations of Direct Access

You should only use Direct Access on a forest that meets one of the following criteria:

The following additional limitations apply to using Direct Access:

  • Accessing documents with Direct Access bypasses security roles and privileges. The content is protected only by the filesystem permissions on the forest data.
  • Direct Access cannot take advantage of indexing or caching when accessing documents. Every document in each participating forest is read, even when you use filtering criteria such as -directory_filter or -type_filter. Filtering can only be applied after reading a document off disk.
  • Direct Access skips property fragments.
  • Direct Access skips documents partitioned into multiple fragments. For details, see Fragments in the Administrator's Guide.
  • Older versions of mlcp might not be able to read forest data from MarkLogic 9 or later. For best results, use the version of mlcp that corresponds to your MarkLogic version.

When you use Direct Access, mlcp skips any forest (or a stand within a forest) that is receiving updates or that is in an error state. Processing continues even when some documents are skipped.

When you use mlcp with Direct Access, your forest data must be reachable from the host(s) processing the input. In distributed mode, the forests must be reachable from the nodes in your Hadoop cluster. In local mode, the forests must be reachable from the host on which you execute mlcp.

If mlcp accesses large or external binaries with Direct Access, then the reachability requirement also applies to the large data directory and any external binary directories. Furthermore, these directories must be reachable along the same path as when the forest was online.

For example, if a forest was configured to use hdfs://my/large/data as a large data directory when it was live and the forest contains a large binary document, then the path hdfs://my/large/data must be resolvable from your Hadoop cluster (distributed mode) or mlcp client host (local mode). Similarly, if a forest contains an external binary documented inserted into the database with /my/external-images/huge.jpg, then /my/external-images/huge.jpg must be reachable.

Choosing Between Export and Extract

You can use the export and extract commands to save content in a MarkLogic database to files on the native file system or HDFS. You should usually use export rather than extract. The extract command is best suited for archive data in offline or read-only forests. Otherwise, use the export command.

The extract command places no load on MarkLogic Server. The export command offloads most of the work to your MarkLogic cluster. Thus, export honors document permissions, takes advantage of database indexes, and can apply transformations and filtering at the server. By contrast, extract bypasses security (other than file permissions on the forest files), must access all document sequentially, and applies a limited set of filters on the client.

The export command offers a richer set of filtering options than extract. In addition, export only accesses the documents selected by your options, while extract must scan the entirety of each input forest, even when extracting selected documents.

For more information, see the following topics:

Extracting Documents as Files

Use the mlcp extract command to extract documents from archival forest files to files on the native filesystem or HDFS. For example, you can extract an XML document as a text file containing XML, or a binary document as a JPG image.

To extract documents from a forest as files:

  1. Set -input_file_path to the path to the input forest directory(s). Specify multiple forests using a comma-separated list of paths.
  2. Select the documents to extract. For details, see Filtering Forest Contents.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select documents by document type, set -type_filter to a comma separated list of document types.
    • To select all documents in the database, leave -collection_filter, -directory_filter, and -type_filter unset.
  3. Set -output_file_path to the destination file or directory on the native filesystem or HDFS. This directory must not already exist.
  4. Set -mode to local or distributed:
    • If Hadoop is available and you want to distribute the workload across a Hadoop cluster, set -mode to distributed. Your input forests must be reachable across your Hadoop cluster.
    • If Hadoop is not installed or you want mlcp to perform the work locally, set -mode to local. Your input forests must be reachable from the host where you execute mlcp.
  5. If you want to extract the documents as files in compressed files, set -compress to true.

    If you are loading from the native filesystem in distributed mode or from HDFS in local mode, you might need to qualify the input file path with a URI scheme of file: or hdfs:. See Understanding Input File Path Resolution.

Filtering options can be combined. Directory names specified with -directory_filter should end with '/'. All filters are applied on the client (or Hadoop task nodes in distributed mode), so every document is accessed, even if it is filtered out of the output document set.

Document URIs are URI-decoded before filesystem directories or filenames are constructed for them. For details, see How URI Decoding Affects Output File Names.

For a full list of extract options, see Extract Command Line Options.

The following example extracts selected documents from the forest files in /var/opt/MarkLogic/Forests/example to the native filesystem directory /space/mlcp/extracted/files. The directory filter selects only the input documents in the database directory /plays.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh extract -mode local \
    -input_file_path /var/opt/MarkLogic/Forests/example \
    -output_file_path /space/mlcp/extracted/files \
    -directory_filter /plays/

Importing Documents from a Forest into a Database

Use the following procedure to load all the files in a native or HDFS forest directory and its sub-directories. To load selected files, see Filtering Documents Loaded From a Directory. For more details on the command line options used in this procedure, see Import Command Line Options.

  1. Set -input_file_path to the path to the input forest directory(s). Specify multiple forests using a comma-separated list of paths.
  2. Set -input_file_type to forest.
  3. Specify the connection information for the destination database using -host, -port, -username, and -password.
  4. Select the files to extract from the input forest. For details, see Filtering Forest Contents. Filtering options can be used together.
    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.
    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.
    • To select documents by document type, set -type_filter to a comma separated list of document types.
    • To select all documents in the database, leave -collection_filter, -directory_filter, and -type_filter unset.
  5. If you want to exclude some or all of the document metadata in the forests:
    • Set -copy_collections to false to exclude document collections metadata.
    • Set -copy_quality to false to exclude document quality metadata.
    • Set -copy_metadata to false to exclude key-value metadata.
  6. Set -mode:
    • If Hadoop is available and you want to distribute the workload across a Hadoop cluster, set -mode to distributed. Your input forests and the destination MarkLogic Server instance must be reachable across your Hadoop cluster.
    • If Hadoop is not installed or you want mlcp to perform the work locally, set -mode to local. (This is the default mode unless you set the HADOOP_CONF_DIR variable.) Your input forests and the destination MarkLogic Server instance must be reachable from the host where you run mlcp.

      If you are loading from the native filesystem in distributed mode or from HDFS in local mode, you might need to qualify the input file path with a URI scheme of file: or hdfs:. See Understanding Input File Path Resolution.

By default, an imported document has a database URI based on the input file path. You can customize the URI using options. For details, see Controlling Database URIs During Ingestion.

The following example command loads the documents in the forests in /var/opt/MarkLogic/Forests/example:

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -input_file_type forest \
    -input_file_path /var/opt/MarkLogic/Forests/example

Extract Command Line Options

This section summarizes the command line options available with the mlcp extract command. An extract command requires the -input_file_path and -output_file_path options. That is, an extract command has the following form:

mlcp.sh extract -input_file_path forest-path \
    -output_file_path dest-path ...

The following table lists command line options that define the characteristics of the extraction:

OptionDescription
-collection_filter comma-list
A comma-separated list of collection URIs. mlcp extracts only documents in these collections. This option can be combined with other filter options. Default: All documents.
-compress boolean
Whether or not to compress the output. Mlcp might generate multiple compressed files. Default: false.
-conf filename
Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-D property=value
Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options.
-directory_filter comma-list
A comma-separated list of database directory names. mlcp extracts only documents from these directories, plus related metadata. Directory names should usually end with '/'. This option can be combined with other filter options. Default: All documents and related metadata.
-hadoop_conf_dir string
When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode.
-max_split_size number
The maximum number of document fragments processed per split. Default: 50000.
-mode string
Export mode. Accepted values: distributed, local. Distributed mode requires Hadoop. Default: local, unless you set the HADOOP_CONF_DIR variable; for details, see Configuring Distributed Mode.
-options_file string
Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.
-output_file_path string
Destination directory where the documents are saved. The directory must not already exist.
-thread_count number
The number of threads to spawn for concurrent exporting. The total number of threads spawned by the process can be larger than this number, but this option caps the number of concurrent sessions with MarkLogic Server. Only available in local mode. Default: 4.
-type_filter comma-list
A comma-separated list of document types. mlcp extracts only documents with these types. This option can be combined with other filter options. Allowed documentypes: xml, text, binary. Default: All documents.

« Previous chapter
Next chapter »
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy