Skip to main content

Using MarkLogic Content Pump (mlcp)

Loading Documents From Compressed Files

You can load content from one or more compressed files. Filtering of compressed file content is not supported; mlcp loads all documents in a compressed file.

Follow this procedure to load content from one or more ZIP or GZIP compressed files.

  1. Set -input_file_path:

    • To load from a single file, set -input_file_path to the path to the compressed file.

    • To load from multiple files, set -input_file_path to a directory containing the compressed files.

  2. If the content type cannot be accurately deduced from suffixes of the files inside the compressed file as described in How mlcp Determines Document Type, set -document_type appropriately.

  3. Set -input_compressed to true.

  4. If the compressed file suffix is not “.zip” or “.gzip”, specify the compressed file format by setting -input_compression_codec to zip or gzip.

If you set -document_type to anything but mixed, then the contents of the compressed file must be homogeneous. For example, all XML, all JSON, or all binary.

The following example command loads binary documents from the compressed file /space/images.zip on the local filesystem.

# Windows users, see Modifying the Example Commands for Windows
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password password -mode local -document_type binary \
    -input_file_path /space/images.zip -input_compressed

The following example loads all the files in the compressed file /space/example.jar, using -input_compression_codec to tell mlcp the compression format because of the “.jar” suffix:

# Windows users, see Modifying the Example Commands for Windows
$ mlcp.sh import -host localhost -port 8000 -username user \
    -password passwd -mode local -input_file_path /space/example.jar \
    -input_compressed true -input_compression_codec zip

If -input_file_path is a directory, mlcp loads contents from all compressed files in the input directory, recursing through subdirectories. The input directory must not contain other kinds of files.

By default, the URI prefix on documents loaded from a compressed file includes the full path to the input compressed file and mirrors the directory hierarchy inside the compressed file. For example, if a ZIP file /space/shakespeare.zip contains bill/data/dream.xml then the ingested document URI is /space/shakespeare.zip/bill/data/dream.xml. To override this behavior, see Controlling Database URIs During Ingestion.