Direct Access enables you to bypass MarkLogic Server and extract documents from a database by reading them directly from the on-disk representation of a forest. This feature is best suited for accessing documents in archived, offline forests.
This section covers the following topics:
Direct Access enables you to extract documents directly from an offline or read-only forest without going through MarkLogic Server. A forest is the internal representation of a collection of documents in a MarkLogic database; for details, see Understanding Forests in the Administrator's Guide. A database can span multiple forests on multiple hosts.
Direct Access is primarily intended for accessing archived data that is part of a tiered storage deployment; for details, see Tiered Storage in the Administrator's Guide. You should only use Direct Access on a forest that is offline or read-only; for details, see Limitations of Direct Access.
For example, if you have data that ages out over time such that you need to retain it, but you do not need to have it available for real time queries through MarkLogic Server, you can archive the data by taking the containing forests offline, but still access the contents using Direct Access.
Use Direct Access with mlcp to access documents in offline and read-only forests in the following ways:
extract
command to extracts archived documents from a database as flat files. This operation is similar to exporting documents from a database to files, but does not require a source MarkLogic Server instance. For details, see Choosing Between Export and Extract.import
command with -input_file_type forest
imports archived documents as to another database as live documents. A destination MarkLogic Server instance is required, but no source instance.You will likely get the best performance out of these operations if you use mlcp in distributed mode and already use HDFS for forest storage. Otherwise, the client where you execute mlcp can become a bottleneck due to resource limitations.
Since Direct Access bypasses the active data management performed by MarkLogic Server, you should not use it on forests receiving document updates. Additional restrictions apply. For details, see Limitations of Direct Access.
You should only use Direct Access on a forest that meets one of the following criteria:
updates-allowed
state of the forest is read-only
. For details, see Setting the Updates-allowed State on Partitions in the Administrator's Guide.The following additional limitations apply to using Direct Access:
-directory_filter
or -type_filter
. Filtering can only be applied after reading a document off disk.When you use Direct Access, mlcp skips any forest (or a stand within a forest) that is receiving updates or that is in an error state. Processing continues even when some documents are skipped.
When you use mlcp with Direct Access, your forest data must be reachable from the host(s) processing the input. In distributed mode, the forests must be reachable from the nodes in your Hadoop cluster. In local mode, the forests must be reachable from the host on which you execute mlcp.
If mlcp accesses large or external binaries with Direct Access, then the reachability requirement also applies to the large data directory and any external binary directories. Furthermore, these directories must be reachable along the same path as when the forest was online.
For example, if a forest was configured to use hdfs://my/large/data
as a large data directory when it was live and the forest contains a large binary document, then the path hdfs://my/large/data
must be resolvable from your Hadoop cluster (distributed mode) or mlcp client host (local mode). Similarly, if a forest contains an external binary documented inserted into the database with /my/external-images/huge.jpg
, then /my/external-images/huge.jpg
must be reachable.
You can use the export
and extract
commands to save content in a MarkLogic database to files on the native file system or HDFS. You should usually use export
rather than extract
. The extract
command is best suited for archive data in offline or read-only forests. Otherwise, use the export
command.
The extract
command places no load on MarkLogic Server. The export
command offloads most of the work to your MarkLogic cluster. Thus, export
honors document permissions, takes advantage of database indexes, and can apply transformations and filtering at the server. By contrast, extract
bypasses security (other than file permissions on the forest files), must access all document sequentially, and applies a limited set of filters on the client.
The export
command offers a richer set of filtering options than extract
. In addition, export
only accesses the documents selected by your options, while extract
must scan the entirety of each input forest, even when extracting selected documents.
Use the mlcp extract
command to extract documents from archival forest files to files on the native filesystem or HDFS. For example, you can extract an XML document as a text file containing XML, or a binary document as a JPG image.
To extract documents from a forest as files:
-input_file_path
to the path to the input forest directory(s). Specify multiple forests using a comma-separated list of paths.-collection_filter
to a comma separated list of collection URIs.-directory_filter
to a comma separated list of directory URIs.-type_filter
to a comma separated list of document types.-collection_filter
, -directory_filter
, and -type_filter
unset.-output_file_path
to the destination file or directory on the native filesystem or HDFS. This directory must not already exist.-mode
to local
or distributed
:-mode
to distributed
. Your input forests must be reachable across your Hadoop cluster.-mode
to local
. Your input forests must be reachable from the host where you execute mlcp.-compress
to true
. If you are loading from the native filesystem in distributed mode or from HDFS in local mode, you might need to qualify the input file path with a URI scheme of file:
or hdfs:
. See Understanding Input File Path Resolution.
Filtering options can be combined. Directory names specified with -directory_filter
should end with /. All filters are applied on the client (or Hadoop task nodes in distributed mode), so every document is accessed, even if it is filtered out of the output document set.
Document URIs are URI-decoded before filesystem directories or filenames are constructed for them. For details, see How URI Decoding Affects Output File Names.
For a full list of extract
options, see Extract Command Line Options.
The following example extracts selected documents from the forest files in /var/opt/MarkLogic/Forests/example to the native filesystem directory /space/mlcp/extracted/files
. The directory filter selects only the input documents in the database directory /plays
.
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh extract -mode local \ -input_file_path /var/opt/MarkLogic/Forests/example \ -output_file_path /space/mlcp/extracted/files \ -directory_filter /plays/
Use the following procedure to load all the files in a native or HDFS forest directory and its sub-directories. To load selected files, see Filtering Documents Loaded From a Directory. For more details on the command line options used in this procedure, see Import Command Line Options.
-input_file_path
to the path to the input forest directory(s). Specify multiple forests using a comma-separated list of paths.-input_file_type
to forest
.-host
, -port
, -username
, and -password
.-collection_filter
to a comma separated list of collection URIs.-directory_filter
to a comma separated list of directory URIs.-type_filter
to a comma separated list of document types.-collection_filter
, -directory_filter
, and -type_filter
unset.-mode
:-mode
to distributed
. Your input forests and the destination MarkLogic Server instance must be reachable across your Hadoop cluster.-mode
to local
. (This is the default mode unless you set the HADOOP_CONF_DIR
variable.) Your input forests and the destination MarkLogic Server instance must be reachable from the host where you run mlcp.If you are loading from the native filesystem in distributed mode or from HDFS in local mode, you might need to qualify the input file path with a URI scheme of file:
or hdfs:
. See Understanding Input File Path Resolution.
By default, an imported document has a database URI based on the input file path. You can customize the URI using options. For details, see Controlling Database URIs During Ingestion.
The following example command loads the documents in the forests in /var/opt/MarkLogic/Forests/example
:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh import -host localhost -port 8000 -username user \ -password password -input_file_type forest \ -input_file_path /var/opt/MarkLogic/Forests/example
This section summarizes the command line options available with the mlcp extract
command. An extract command requires the -input_file_path
and -output_file_path
options. That is, an extract command has the following form:
mlcp.sh extract -input_file_path forest-path \ -output_file_path dest-path ...
The following table lists command line options that define the characteristics of the extraction:
Option | Description |
---|---|
-collection_filter comma-list |
A comma-separated list of collection URIs. mlcp extracts only documents in these collections. This option can be combined with other filter options. Default: All documents. |
-compress boolean |
Whether or not to compress the output. mlcp might generate multiple compressed files. Default: false . |
-conf filename |
Pass extra setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options. |
-D property=value |
Pass a configuration property setting to Hadoop when using distributed mode. For details, see Setting Custom Hadoop Options and Properties. This option must appear before mlcp-specific options. |
-directory_filter comma-list |
A comma-separated list of database directory names. mlcp extracts only documents from these directories, plus related metadata. Directory names should usually end with /. This option can be combined with other filter options. Default: All documents and related metadata. |
-hadoop_conf_dir string |
When using distributed mode, the Hadoop config directory. For details, see Configuring Distributed Mode. |
-max_split_size number |
The maximum number of document fragments processed per split. Default: 50000. |
-mode string |
Export mode. Accepted values: distributed , local . Distributed mode requires Hadoop. Default: local , unless you set the HADOOP_CONF_DIR variable; for details, see Configuring Distributed Mode. |
-options_file string |
Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax. |
-output_file_path string |
Destination directory where the documents are saved. The directory must not already exist. |
-thread_count number |
The number of threads to spawn for concurrent exporting. The total number of threads spawned by the process can be larger than this number, but this option caps the number of concurrent sessions with MarkLogic Server. Only available in local mode. Default: 4. |
-type_filter comma-list |
A comma-separated list of document types. mlcp extracts only documents with these types. This option can be combined with other filter options. Allowed documentypes: xml , text , binary . Default: All documents. |