Skip to main content

Using MarkLogic Content Pump (mlcp)

Improving Throughput with -split_input

If you are loading documents from very large files, you might be able to improve throughput using the -split_input option. When -split_input is true, mlcp attempts to break large input files that would otherwise be processed in a single split into multiple splits. This enables portions of the input file to be loaded by threads (local mode).

Note

This option can only be applied to composite input file types that logically produce multiple documents and for which mlcp can efficiently identify document boundaries, such as delimited_text. Not all composite file types are supported, and files containing multi-byte characters must be UTF-8-encoded. For details, see Import Command Line Options.

In local mode, -split_input is false by default.

The -split_input option affects local mode as follows: Suppose you are importing a very large delimited text file in local mode with -split_input set to false and the data processed as a single split. The work might be performed by multiple threads (depending on the job configuration), but these threads read records from the input file synchronously. This can cause some read contention. If you set -split_input to true, then each thread is assigned its own chunk of input, resulting in less contention and greater concurrency.

The number of subdivisions is determined by the formula file-size / max-split-size, so you should also consider tuning split size to match your input data characteristics. For example, if your data consists of 1 delimited text file containing 16M of data, you can observe the following interactions between -split_input and -max_split_size:

Input File Size

-split_input

Split Size

Number of Splits

16M

false

32M

1

16M

true

32M

1

16M

true

1M

16

Tuning the split size in this case potentially enables greater concurrency because the multiple splits can be assigned to different threads or tasks.

Split size is tunable using -max_split_size, -min_split_size, and block size. For details, see Tuning Split Size and Thread Count for Local Mode.