Tuning Split Size and Thread Count for Local Mode
You can tune split size only when importing documents in local mode from one of the following input file types:
Whole documents (
-input_file_type documents
), whether from flat or compressed files.Composite file types that support
-split_inpu
t, such asdelimited_text
.
You cannot tune split size when creating documents from composite files that do not support -split_input
, such as sequence files and aggregate XML files.
You can tune thread count for both whole documents and all composite filetypes. Thread count and split size can interact to affect job performance.
In local mode, a split defines the unit of work per thread devoted to a session with MarkLogic Server. The ideal split size is one that keeps all mlcp session threads busy. The default split size is 32M for local mode. Use the -max_split_size
, -thread_count
, and -thread_count_per_split
options to tune your load.
By default, threads are assigned to splits in a round-robin fashion. For example, consider a loading 120 small documents of length 1M. Since the default split size is 32M, the load is broken into 4 splits. If -thread_count
is 10, each split is assigned to at least 2 threads (10 / 4 = 2
). The remaining 2 threads are each assigned to a split, so the number of threads per split are distributed as follows:
Split 1: 3 threads
Split 2: 3 threads
Split 3: 2 threads
Split 4: 2 threads
This distribution could result in two of the splits completing faster, leaving some threads idle. If you set -max_split_size
to 12M, the load has 10 splits, which can be evenly distributed across the threads and may result in better thread utilization.
Prior to 10.0-4.2, mlcp uses 4 as the default thread count. For mlcp versions equal to or higher than 10.0-4.2, mlcp conducts initial polling to identify the available server threads on the port that handles mlcp requests. Mlcp then uses this value as the default thread count. Users can overwrite the default value by specifying -thread_count
in the command line.
If -thread_count
is less than the number of splits, the default behavior is one thread per split, up to the total number of threads. The remaining splits must wait until a thread becomes available.
Note
If you specify -thread_count_per_split
, each input split will run with the specified number. The total number of thread count, however, is controlled by the newly calculated thread count or -thread_count
if it is specified.
If MarkLogic Server is not I/O bound, then raising the thread count--and possibly threads per split--can improve throughput when the number of splits is small but each split is very large. This is often applicable to loading from zip files, aggregate files, and delimited text files. Note that if MarkLogic Server is already I/O bound in your environment, increasing the concurrency of writes will not necessarily improve performance.