Skip to main content

Using MarkLogic Content Pump (mlcp)

Extracting Documents as Files

Use the mlcp extract command to extract documents from archival forest files to files on the native filesystem. For example, you can extract an XML document as a text file containing XML, or a binary document as a JPG image.

To extract documents from a forest as files:

  1. Set -input_file_path to the path to the input forest directory(s). Specify multiple forests using a comma-separated list of paths.

  2. Select the documents to extract. For details, see Filtering Forest Contents.

    • To select documents in one or more collections, set -collection_filter to a comma separated list of collection URIs.

    • To select documents in one or more database directories, set -directory_filter to a comma separated list of directory URIs.

    • To select documents by document type, set -type_filter to a comma separated list of document types.

    • To select all documents in the database, leave -collection_filter, -directory_filter, and -type_filter unset.

  3. Set -output_file_path to the destination file or directory on the native filesystem. This directory must not already exist.

  4. Set -mode to local: Your input forests must be reachable from the host where you execute mlcp.

  5. If you want to extract the documents as files in compressed files, set -compress to true.

Filtering options can be combined. Directory names specified with -directory_filter should end with “/”. All filters are applied on the client, so every document is accessed, even if it is filtered out of the output document set.

Note

Document URIs are URI-decoded before filesystem directories or filenames are constructed for them. For details, see How URI Decoding Affects Output File Names.

For a full list of extract options, see Extract Command Line Options.

The following example extracts selected documents from the forest files in /var/opt/MarkLogic/Forests/example to the native filesystem directory /space/mlcp/extracted/files. The directory filter selects only the input documents in the database directory /plays.

# Windows users, see Modifying the Example Commands for Windows 
$ mlcp.sh extract -mode local \
    -input_file_path /var/opt/MarkLogic/Forests/example \
    -output_file_path /space/mlcp/extracted/files \
    -directory_filter /plays/