Skip to main content

Using MarkLogic Content Pump (mlcp)

Import Command Line Options

This section summarizes the command line options available with the mlcp import command. The following command line options define your connection to MarkLogic:

Option

Description

-host comma-list

Required. A comma-separated list of hosts through which mlcp can connect to the destination MarkLogic Server. You must specify at least one host. For more details, see How mlcp Uses the Host List.

-password string

Password for the MarkLogic Server user specified with -username. Required, unless using Kerberos authentication.

-port number

Port number of the destination MarkLogic Server. There should be an XDBC App Server on this port. Default: 8000.

-username string

MarkLogic Server user with which to import documents. Required, unless using Kerberos authentication.

The following table lists command line options that define the characteristics of the import operation:

Option

Description

-aggregate_record_element string

When splitting an aggregate input file into multiple documents, the name of the element to use as the output document root. Default: The first child element under the root element.

-aggregate_record_namespace string

The namespace of the element specified by -aggregate_record_element_name. Default: No namespace.

-aggregate_uri_id string

Deprecated. Use -uri_id instead.

When splitting an aggregate input file into multiple documents, the element or attribute name within the document root to use as the document URI. Default: In local mode, hashcode-seqnum, where the hashcode is derived from the split number; in distribute mode, taskid-seqnum.

-api_key string [v11.1.0 and up]

User API Key unique to each MarkLogic Cloud user for obtaining session token. Required along with -base_path when connecting to MarkLogic Cloud. See Connecting mlcp to MarkLogic Cloud.

-archive_metadata_optional boolean

When importing documents from a database archive, whether or not to ignore missing metadata files. If this is false and the archive contains no metadata, an error occurs. Default: false.

-base_path string [v11.1.0 and up]

A base URL that maps to a port on the destination MarkLogic server when connecting through a reverse proxy.

-batch_size number

The number of documents to process in a single request to MarkLogic Server. Default: 100. Maximum: 200.

-collection_filter comma-list

A comma-separated list of collection URIs. Only usable with -input_file_type forest. mlcp extracts only documents in these collections. This option can be combined with other filter options. Default: Import all documents.

-content_encoding string

The character encoding of input documents when -input_file_type is documents, aggregates, delimited_text, or rdf. The option value must be a character set name accepted by your JVM; see java.nio.charset.Charset. Default: UTF-8. Set to system to use the platform default encoding for the host on which mlcp runs.

-copy_collections boolean

When importing documents from an archive, whether to copy document collections from the source archive to the destination. Only applies when -input_file_type is archive or forest. Default: true.

-copy_metadata boolean

When importing documents from an archive, whether to copy document key-value metadata from the source archive to the destination. Only applies when -input_file_type is archive or forest. Default: true.

-copy_permissions boolean

When importing documents from an archive, whether to copy document permissions from the source archive to the destination. Only applies with -input_file_type archive. Default: true.

-copy_properties boolean

When importing documents from an archive, whether to copy document properties from the source archive to the destination. Only applies with -input_file_type archive. Default: true.

-copy_quality boolean

When importing documents from an archive, whether to copy document quality from the source archive to the destination. Only applies when -input_file_type is archive or forest. Default: true.

-database string

The name of the destination database. Default: The database associated with the destination App Server identified by -host and -port.

-data_type comma-list

When importing content with -input_file_type delimited_text and -document_type json, use this option to specify the data type (string, number, or boolean) to give to specific fields. The option value must be a comma separated list of name,datatype pairs, such as “a,number,b,boolean”. Default: All fields have string type. For details, see Controlling Data Type in JSON Output.

-delimited_root_name string

When importing content with -input_file_type delimited_text, the local name of the document root element. Default: root.

-delimited_uri_id string

Deprecated. use -uri_id instead.

When importing content -input_file_type delimited_text, the column name that contributes to the id portion of the URI for inserted documents. Default: The first column.

-delimiter character

When importing content with -input_file_type delimited_text, the delimiting character. Default: comma (,).

-directory_filter comma-list

A comma-separated list of database directory names. Only usable with -input_file_type forest. mlcp extracts only documents from these directories, plus related metadata. Directory names should usually end with “/”. This option can be combined with other filter options. Default: Import all documents.

-document_type string

The type of document to create when -input_file_type is documents, sequencefile or delimited_text. Accepted values: mixed(docu­ments only), xml, json, text, binary. Default: mixed for documents, xml for sequencefile, and xml for delimited_text.

-fastload boolean

Whether or not to force optimal performance, even at the risk of creating duplicate document URIs. See Time vs. Correctness: Understanding -fastload Tradeoffs. Default: false.

-filename_as_collection boolean

Add each loaded document to a collection corresponding to the name of the input file. You cannot use this option when -input_file_type is rdf or forest. Useful when splitting an input file into multiple documents. If the filename contains characters not permitted in a URI, those characters are URI encoded. Default: false.

-generate_uri boolean

When importing content with -input_file_type delimited_text, or -input_file_type delimited_json, whether or not MarkLogic Server should automatically generate document URIs. Default: false for delimited_text, true for delimited_json. For details, see Default Document URI Construction.

-input_compressed boolean

Whether or not the source data is compressed. Default: false.

-input_compression_codec string

When -input_compressed is true, the code used for compression. Accepted values: zip, gzip.

-input_file_path string

A regular expression describing the filesystem location(s) to use for input. For details, see Regular Expression Syntax.

-input_file_pattern string

Load only input files that match this regular expression from the path(s) matched by -input_file_path. For details, see Regular Expression Syntax. Default: Load all files. This option is ignored when -input_file_­type is forest.

-input_file_type type

The input file type. Accepted value: aggregates, archive, delimited_text, delimited_json, documents, forest, rdf, sequencefile. Default: documents.

-keystore_password string

Password to a Java KeyStore containing the User Private Key(s) and Certificate(s); if available mlcp will select the first available certificate from the KeyStore that satisfy the TLS Certificate Request from the MarkLogic Server.

Can be passed along with the existing -ssl option.

-keystore_path string

Path to a Java KeyStore containing the User Private Key(s) and Certificate(s); if available mlcp will select the first available certificate from the KeyStore that satisfies the TLS Certificate Request from the MarkLogic Server.

Can be passed along with the existing -ssl option.

-max_split_size number

When importing from files, the maximum number of bytes in one input split. Default: The maximum Long value (Long.MAX_VALUE).

-max_thread_percentage

The maximum percentage (integer between 0 and 100) of available server threads used by mlcp for import jobs. Default: 100.

-max_threads

The maximum number of threads that run mlcp. This command line option is optional.

-min_split_size number

When importing from files, the minimum number of bytes in one input split. Default: 0.

-mode string

Ingestion mode. Accepted values: local.

-modules_root string

The modules root path to use when applying a server-side transformation. Default: The modules root configured for the App Server. If you also use -modules, then this path specifies the modules root for that modules database.

-modules string

Specify the name of the modules database to use when applying a server-side transformation. Accepted values: filesystem or a modules database name. Default: The modules database associated with the App Server.

-namespace string

The default namespace for all XML documents created during loading.

-options_file string

Specify an options file pathname from which to read additional command line options. If you use an options file, this option must appear first. For details, see Options File Syntax.

-output_cleandir boolean

Whether or not to delete all content in the output database directory prior to loading. Default: false.

-output_collections comma-list

A comma separated list of collection URIs. Loaded documents are added to these collections.

-output_directory string

The destination database directory in which to create the loaded documents. If the directory exists, its contents are removed prior to ingesting new documents. Using this option enables -fastload by default, which can cause duplicate URIs to be created. See Time vs. Correctness: Understanding -fastload Tradeoffs.

-output_graph string

Only usable with -input_file_type rdf. For quad data, specifies the default graph for quads that do not include an explicit graph label. For other triple formats, specifies the graph into which to load all triples. For details, see Loading Triples.

-output_language string

The xml:lang to associate with loaded documents.

-output_override_graph string

Only usable with -input_file_type rdf. The graph into which to load all triples. For quads, overrides any graph label in the quads. For details, see Loading Triples.

-output_partition string

The name of the database partition in which to create documents. For details, see How Assignment Policy Affects Optimization, and Range Partitions or Query Partitions in Administrating MarkLogic Server.

-output_permissions comma-list

A comma separated list of (role,capability) pairs to apply to loaded documents. Default: The default permissions associated with the user inserting the document. Example: -output_permissions role1,read,role2,update

-output_quality string

The quality of loaded documents. Default: 0.

-output_uri_prefix string

Specify a prefix to prepend to the default URI. Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion.

-output_uri_replace comma-list

A comma separated list of (regex,string) pairs that define string replacements to apply to the URIs of documents added to the database. The replacement strings must be enclosed in single quotes. For example, -output_uri_replace "regex1,'string1',regext2,'string2'"

-output_uri_suffix string

Specify a suffix to append to the default URI Used to construct output document URIs. For details, see Controlling Database URIs During Ingestion.

-polling_init_delay

The initial delay (in minutes) before mlcp starts sending polling request to check the available server threads. Default: 1.

-polling_period

The time interval (in minutes) mlcp sends polling request to check the current available server threads. Default: 1.

-restrict_hosts boolean

Restrict mlcp to connect to MarkLogic only through the hosts listed in the -host option. For more details, see Restricting the Hosts That mlcp Uses to Connect to MarkLogic.

-split_input boolean

Whether or not to divide input data into logical chunks to support more concurrency. Only supported when -input_file_type is one of the following: delimited_text. Default: false for local mode. Data that contains multi-byte characters must be UTF-8-encoded to use this option. For details, see Improving Throughput with -split_input.

-ssl boolean

Enable/disable SSL secured communication with MarkLogic. Default: false. If you set this option to true, your App Server must be SSL enabled. For details, see Connecting to MarkLogic Using SSL.

-ssl_protocol string

Specify the protocol that mlcp should use when creating an SSL connection to MarkLogic. You must include this option if you use the -ssl option to connect to an App Server configured to disable the MarkLogic default protocol (TLSv1.2). Allowed values: tls, tlsv1, tlsv1.1, tlsv1.2. Default: TLSv1.2.

-streaming boolean

Whether or not to stream documents to MarkLogic Server. Applies only when -input_file_type is documents.

-temporal_collection string

The temporal collection into which the temporal documents are to be loaded. For details on loading temporal documents into MarkLogic, see Using MarkLogic Content Pump (mlcp) to Load Temporal Documents in the Temporal Developer’s Guide.

-thread_count number

The number of threads to spawn for concurrent loading.

Instead of using 4 as the default thread count prior to 10.0-4.2, mlcp now conducts initial polling to identify the available server threads on the port that handles mlcp requests. mlcp then uses this value as the default thread count. Users can overwrite it by specifying -thread_count in the command line.

-thread_count_per_split number

The maximum number of threads that can be assigned to each split.

If you specify -thread_count_per_split, each input split will run with the specified number.

The total number of thread count, however, is controlled by the newly calculated thread count or -thread_count if it is specified.

-tolerate_errors boolean

NOTE: This option is deprecated, ignored, and will be removed in a future release. mlcp always behaves as if -tolerate_errors is true.

Applicable only when -batch_size is greater than 1. When this option is true and batch size is greater than 1, if an error occurs for one or more documents during loading, only the erroneous documents are skipped; all other documents are inserted into the database. When this option is false or batch size is 1, errors during insertion can cause all the inserts in the current batch to be rolled back. Default: false.

-transaction_size number

The number of requests to MarkLogic Server per transaction. Default: 1. Maximum: 4000/actualBatchSize.

-transform_function string

The local name of a custom content transformation function installed on MarkLogic Server. Ignored if -transform_module is not specified. Default: transform. For details, see Transforming Content During Ingestion.

-transform_module string

The path in the modules database or modules directory of a custom content transformation function installed on MarkLogic Server. This option is required to enable a custom transformation. For details, see Transforming Content During Ingestion.

-transform_namespace string

The namespace URI of the custom content transformation function named by -transform_function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion.

-transform_param string

Optional extra data to pass through to a custom transformation function. Ignored if -transform_module is not specified. Default: no namespace. For details, see Transforming Content During Ingestion.

-truststore_passwd string

Password to a Java TrustStore containing any necessary CA Certificates needed to verify the TLS Server Authentication connection. If no TrustStore is provided the default TrustStore used by the existing -ssl parameter is used.

Can be passed along with the existing -ssl option.

-truststore_path string

Path to a Java TrustStore containing any necessary CA Certificates needed to verify the TLS Server Authentication connection. If no TrustStore is provided the default TrustStore used by the existing -ssl parameter is used.

Can be passed along with the existing -ssl option.

-type_filter comma-list

A comma-separated list of document types. Only usable with -input_file_type forest. mlcp imports only documents with these types. This option can be combined with other filter options. Default: Import all documents.

-uri_id string

Specify a field, XML element name, or JSON property name to use as the basis of the output document URIs when importing delimited text, aggregate XML, or line-delimited JSON data.

With -input_file_type aggregates or -input_file_type delimited_json, the element, attribute, or property name within the document to use as the document URI. Default: None; the URI is based on the file name, as described in Default Document URI Construction.

With -input_file_type delimited_text, the column name that contributes to the id portion of the URI for inserted documents. Default: The first column.

-xml_repair_level string

The degree of repair to attempt on XML documents in order to create well-formed XML. Accepted values: default, full, none. Default: default, which depends on the configured MarkLogic Server default XQuery version: In XQuery 1.0 and 1.0-ml the default is none. In XQuery 0.9-ml the default is full.

We do not recommend using concurrent mlcp jobs. Regardless of the version, mlcp doesn’t support concurrent jobs if mlcp is importing from/exporting to the same data file. In addition, beginning in 10.0-4.2, each mlcp job uses the maximum number of threads available on the server as the default thread count (more about this can be found in the 10.0-4.2 release notes). Therefore, using concurrent mlcp jobs will not improve performance, as one job is already using full concurrent capacity.