Time vs. Correctness: Understanding -fastload Tradeoffs
The -fastload
option can significantly speed up ingestion during import
and copy
operations, but it can also cause problems if not used properly. This section describes how -fastload
affects the behavior of mlcp and some of the tradeoffs associated with enabling it.
The optimizations described by this section are only enabled if you explicitly specify the -fastload
or -output_directory
options. (The -output_directory
option implies -fastload
).
Note
The -fastload
option work slightly different when used with -restrict_hosts
. For details, see How -restrict_hosts Affects -fastload. The limitations of -fastload
described in this section still apply.
By default, mlcp inserts documents into the database by distributing work across the e-nodes in your MarkLogic cluster. Each e-node inserts documents into the database according to the configured document assignment policy.
This means the default insertion process for a document is similar to the following:
mlcp selects Host A from the available e-nodes in the cluster and sends it the document to be inserted.
Using the document assignment policy configured for the database, Host A determines the document should be inserted into Forest F on Host B.
Host A sends the document to Host B for insertion.
When you use -fastload
(or -output_directory
), mlcp attempts to cut out the middle step by applying the document assignment policy on the client. The interaction becomes similar to the following:
Using the document assignment policy, mlcp determines the document should be inserted into Forest F on Host B.
mlcp sends the document to Host B for insertion, with instructions to insert it into a specific forest.
Pre-determining the destination host and forest can always be done safely and consistently if all of the following conditions are met:
Your forest topology is stable.
You are creating rather than updating documents.
To make forest assignment decisions locally, mlcp gathers information about the database assignment policy and forest topology at the beginning of a job. If you change the assignment policy or forest topology while an mlcp import
or copy
operation is running, mlcp might make forest placement decisions inconsistent with those MarkLogic Server would make. This can cause problems such as duplicate document URIs and unbalanced forests.
Similar problems can occur if mlcp attempts to update a document already in the database, and the forest topology or assignment policy changes between the time the document was originally inserted and the time mlcp updates the document. Using user-specified forest placement when initially inserting a document creates the same conflict.
Therefore, it is not safe to enable -fastload
optimizations in the following situations:
A document mlcp inserts already exists in the database and any of the following conditions are true:
The forest topology has changed since the document was originally inserted.
The assignment policy has changed since the document was originally inserted.
The assignment policy is not Legacy (default) or Bucket. For details, see How Assignment Policy Affects Optimization.
The document was originally inserted using user-specified forest placement.
A document mlcp inserts does not already exist in the database and any of the following conditions are true:
The forest topology changes while mlcp is running.
The assignment policy changes while mlcp is running.
Assignment policy is a database configuration setting that affects how MarkLogic Server selects what forest to insert a document into or move a document into during rebalancing. For details, see Rebalancer Document Assignment Policies in Administrating MarkLogic Server.
Note
Assignment policy was introduced with MarkLogic 7 and mlcp v1.2. If you use an earlier version of mlcp with MarkLogic 7 or later, the database you import data into with -fastload
or -output_directory
must be using the legacy assignment policy.
Any operation that changes the forests available for updates changes your forest topology, including the following:
Adding or an employing a new forest
Removing or retiring an existing forest
Changing the
updates-allowed
state of forest. For example, callingadmin:forest-set-updates-allowed
Changing the database assignment policy
In most cases, it is your responsibility to determine whether or not you can safely use -fastload
(or -output_directory
, which implies -fastload
). In cases where mlcp can detect -fastload
is unsafe, it will disable it or give you an error.