Loading Documents from Compressed Files
You can load content from one or more compressed files. Filtering of compressed file content is not supported; mlcp loads all documents in a compressed file.
Follow this procedure to load content from one or more ZIP or GZIP compressed files.
Set
-input_file_path
:To load from a single file, set
-input_file_path
to the path to the compressed file.To load from multiple files, set
-input_file_path
to a directory containing the compressed files.
If the content type cannot be accurately deduced from suffixes of the files inside the compressed file as described in How mlcp Determines Document Type, set
-document_type
appropriately.Set
-input_compressed
totrue
.If the compressed file suffix is not “.zip” or “.gzip”, specify the compressed file format by setting
-input_compression_codec
tozip
orgzip
.
If you set -document_type
to anything but mixed
, then the contents of the compressed file must be homogeneous. For example, all XML, all JSON, or all binary.
The following example command loads binary documents from the compressed file /space/images.zip
on the local filesystem.
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh import -host localhost -port 8000 -username user \ -password password -mode local -document_type binary \ -input_file_path /space/images.zip -input_compressed
The following example loads all the files in the compressed file /space/example.jar
, using -input_compression_codec
to tell mlcp the compression format because of the “.jar” suffix:
# Windows users, see Modifying the Example Commands for Windows $ mlcp.sh import -host localhost -port 8000 -username user \ -password passwd -mode local -input_file_path /space/example.jar \ -input_compressed true -input_compression_codec zip
If -input_file_path
is a directory, mlcp loads contents from all compressed files in the input directory, recursing through subdirectories. The input directory must not contain other kinds of files.
By default, the URI prefix on documents loaded from a compressed file includes the full path to the input compressed file and mirrors the directory hierarchy inside the compressed file. For example, if a ZIP file /space/shakespeare.zip
contains bill/data/dream.xml
then the ingested document URI is /space/shakespeare.zip/bill/data/dream.xml
. To override this behavior, see Controlling Database URIs During Ingestion.