Skip to main content

Using MarkLogic Content Pump (mlcp)

Time vs. Correctness: Understanding -fastload Tradeoffs

The -fastload option can significantly speed up ingestion during import and copy operations, but it can also cause problems if not used properly. This section describes how -fastload affects the behavior of mlcp and some of the tradeoffs associated with enabling it.

The optimizations described by this section are only enabled if you explicitly specify the -fastload or -output_directory options. (The -output_directory option implies -fastload).

Note

The -fastload option work slightly different when used with -restrict_hosts. For details, see How -restrict_hosts Affects -fastload. The limitations of -fastload described in this section still apply.

By default, mlcp inserts documents into the database by distributing work across the e-nodes in your MarkLogic cluster. Each e-node inserts documents into the database according to the configured document assignment policy.

This means the default insertion process for a document is similar to the following:

  1. mlcp selects Host A from the available e-nodes in the cluster and sends it the document to be inserted.

  2. Using the document assignment policy configured for the database, Host A determines the document should be inserted into Forest F on Host B.

  3. Host A sends the document to Host B for insertion.

When you use -fastload (or -output_directory), mlcp attempts to cut out the middle step by applying the document assignment policy on the client. The interaction becomes similar to the following:

  1. Using the document assignment policy, mlcp determines the document should be inserted into Forest F on Host B.

  2. mlcp sends the document to Host B for insertion, with instructions to insert it into a specific forest.

Pre-determining the destination host and forest can always be done safely and consistently if the all of the following conditions are met:

  • Your forest topology is stable.

  • You are creating rather than updating documents.

To make forest assignment decisions locally, mlcp gathers information about the database assignment policy and forest topology at the beginning of a job. If you change the assignment policy or forest topology while an mlcp import or copy operation is running, mlcp might make forest placement decisions inconsistent with those MarkLogic Server would make. This can cause problems such as duplicate document URIs and unbalanced forests.

Similar problems can occur if mlcp attempts to update a document already in the database, and the forest topology or assignment policy changes between the time the document was originally inserted and the time mlcp updates the document. Using user-specified forest placement when initially inserting a document creates the same conflict.

Therefore, it is not safe to enable -fastload optimizations in the following situations:

  • A document mlcp inserts already exists in the database and any of the following conditions are true:

    • The forest topology has changed since the document was originally inserted.

    • The assignment policy has changed since the document was originally inserted.

    • The assignment policy is not Legacy (default) or Bucket. For details, see How Assignment Policy Affects Optimization.

    • The document was originally inserted using user-specified forest placement.

  • A document mlcp inserts does not already exist in the database and any of the following conditions are true:

    • The forest topology changes while mlcp is running.

    • The assignment policy changes while mlcp is running.

Assignment policy is a database configuration setting that affects how MarkLogic Server selects what forest to insert a document into or move a document into during rebalancing. For details, see Rebalancer Document Assignment Policies in Administrating MarkLogic Server.

Note

Assignment policy was introduced with MarkLogic 7 and mlcp v1.2. If you use an earlier version of mlcp with MarkLogic 7 or later, the database you import data into with -fastload or -output_directory must be using the legacy assignment policy.

Any operation that changes the forests available for updates changes your forest topology, including the following:

  • Adding or an employing a new forest

  • Removing or retiring an existing forest

  • Changing the updates-allowed state of forest. For example, calling admin:forest-set-updates-allowed

  • Changing the database assignment policy

In most cases, it is your responsibility to determine whether or not you can safely use -fastload (or -output_directory, which implies -fastload). In cases where mlcp can detect -fastload is unsafe, it will disable it or give you an error.