Skip to main content

Using MarkLogic Content Pump (mlcp)

Retry Mechanism When Commit Fails During Ingestion

When mlcp is used to ingest content into Data Hub Service (DHS), it frequently catches exceptions when the static e-node gets overloaded, or if the dynamic e-nodes are unavailable, as they come and go.

Before 10.0-5, when an mlcp commit failed during ingestion, due to the exceptions listed above, mlcp did not retry the batch. All the documents in the current batch would fail permanently. The mlcp retry mechanism has been added in 10.0-5 to make mlcp more robust and able to recover from these exceptions.

There are three circumstances that need to be considered:

  • If -batch_size is 1 and -transaction_size is 1: mlcp uses AUTO transaction mode. Transactions automatically commit and rollback. mlcp will retry inserting the whole batch when it catches exceptions during commit.

  • If -batch_size is larger than 1 and -transaction_size is 1: mlcp will use UPDATE transaction mode, and explicitly commits and rolls back. mlcp will retry loading the whole batch if the exceptions caught during commit are retryable. mlcp will retry when commit fails maximum 15 times. Between each retry, it sleeps for a certain amount of time. The interval varies from 0.5 seconds to 2 minutes, and it doubles every time retries. The total maximum sleep time sums up to ~16 minutes, which is tuned to wait for dynamic e-nodes to come up. In most cases, a successful retry will not cause any insertions to fail.

  • If -batch_size is larger than 1 and -transaction_size is larger than 1: mlcp does not retry in this situation as the client only caches the current batch. All the documents in the current transaction will fail permanently.

mlcp only retries when the exceptions caught are retryable. Every time when mlcp retries, it attempts to select another host. When the exceptions are not retryable, or the retry doesn't succeed within ~16 minutes for the DHS cluster to recover, all the documents in the current batch will fail permanently and mlcp will log the failure.

When the current batch fails during inserting or committing, the failures will be logged on WARN level. Then if the exception is retryable, mlcp will retry inserting the whole batch, and the retry messages will be logged on DEBUG level. If the retry succeeds, the succeeding message will be logged on INFO level. If the exception is not retryable, or the maximum retry limit has been exceeded, the document/batch will fail permanently and will be logged on ERROR level.

Each log message has a batch number in the format of xxxx.xxxx (two integers separated by a dot) attached to it. The first integer represents the current thread number and the second represents the batch count local to the current thread. Globally, xxxx.xxxx is unique. This batch number makes it easier to track down and debug batch failures.

The following messages are an example of common exceptions caught when running mlcp with DHS cluster on AWS/Azure. These exceptions mostly happens when e-nodes are down or the static e-node gets overloaded. Timestamps have been removed from these examples.

...WARN contentpump.TransformWriter: Batch #88895712.638: Failed committing transaction: Error parsing HTTP headers: Premature EOF, partial header line read: ''
...WARN mapreduce.ContentWriter: Batch #88895712.638: QueryException:XDMP-XDQPDISC: XDQP connection disconnected, server=somehost
...WARN contentpump.TransformWriter: Batch #1520482927.642: Failed committing transaction: Server cannot accept request: Service Unavailable -- Stopping by SIGTERM from pid 3121
...WARN mapreduce.ContentWriter: Batch #1520482927.642: com.marklogic.xcc.exceptions.XQueryException: XDMP-NOTXN: No transaction with identifier 11132444146034518336
[Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=5bJZEjQ1L.z.marklogicsvc.com/52.224.204.231:8005, pool=0/64]]]
[Client: XCC/11.0-20200911, Server: XDBC/10.0-4]

Note

mlcp gets XDMP-NOTXN when the transaction has already been committed or rolled back.

The following messages are an example of output during a retry event. Timestamps have been removed.

...WARN contentpump.TransformWriter: Batch 1473219859.1010: Exception:Server cannot accept request: Gateway Time-out
...WARN contentpump.TransformWriter: Batch 1473219859.1010: Failed during inserting
...DEBUG mapreduce.ContentWriter: Batch 1473219859.1010: Sleeping before retrying...sleepTime=500ms
...DEBUG contentpump.TransformWriter: Batch 1473219859.1010: Retrying inserting batch, attempts: 1/15
...INFO contentpump.TransformWriter: Batch 1473219859.1010: Retrying inserting batch is successful
...WARN contentpump.TransformWriter: Batch 278973739.75: Failed committing transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 918057596.3: Failed committing transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 278973739.75: Failed during committing
...WARN contentpump.TransformWriter: Batch 918057596.3: Failed during committing
...WARN contentpump.TransformWriter: Batch 1763434846.80: Failed committing transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 1763434846.80: Failed during committing
...WARN contentpump.TransformWriter: Batch 981349710.122: Failed committing transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 981349710.122: Failed during committing
...WARN mapreduce.ContentWriter: Batch 278973739.75: Failed rolling back transaction: No transaction
...DEBUG mapreduce.ContentWriter: com.marklogic.xcc.exceptions.XQueryException: XDMP-NOTXN: No transaction with identifier 11132444146034518336
[Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=5bJZEjQ1L.z.marklogicsvc.com/52.224.204.231:8005, pool=0/64]]]
[Client: XCC/11.0-20200911, Server: XDBC/10.0-4]
...DEBUG mapreduce.ContentWriter: Batch 278973739.75: Sleeping before retrying...sleepTime=500ms
...WARN contentpump.TransformWriter: Batch 1978594827.298: QueryException: JS-FATAL: xdmp:function(fn:QName(, transformInsertBatch), /MarkLogic/hadoop.sjs)($transform-module, $transform-function, $uris, $values, $insert-options, $transform-option) 
...WARN contentpump.TransformWriter: Batch 1978594827.298: Failed during inserting
...ERROR contentpump.TransformWriter: Batch 1978594827.298: Document failed permanently: /space/data/iplocations/IP2LOCATION-LITE-DB5.CSV.gz-0-2798613 in file:/space/data/iplocations/IP2LOCATION-LITE-DB5.CSV.gz at line 2798614