Failover Handling
Failover occurs when a forest or a host in a cluster becomes unavailable, due to events such as a forest restart or a host becoming unreachable. You can configure a database to use local or shared disk failover to attempt automatic recovery; for details see High Availability of Data Nodes With Failover in the Scalability, Availability, and Failover Guide.
Note
Failover support in mlcp is only available when running mlcp against MarkLogic 9 or later. With older MarkLogic versions, the job will fail if mlcp is connected to a host that becomes unavailable.
mlcp always attempts to connect to a new host during a failover event. mlcp can potentially recover from failover event in the following cases:
If mlcp receives a connection error that indicates an e-node serving the database is down, mlcp attempts to select another host. For a job that is not running in fastload mode, mlcp selects the next host in its host list. For a fastload job, mlcp attempts to determine the replica forest and host and connect to that host.
If mlcp receives a retriable error from MarkLogic, it will retry the operation with the same host. For example, a forest restart or a forest replica host going down can cause a retriable error.
If mlcp is able to re-establish a connection in these cases, then the job can continue. It is possible for some documents not to be imported, depending on the configuration of the job. mlcp can only retry the current batch.
If
-transaction_size
is 1, then mlcp only needs to retry the current batch. In most cases, a successful failover will not cause any insertions to fail.If
-transaction_size
is greater than 1, then mlcp can only retry the current batch. Other batches in the same transaction cannot be retried. Some documents might not be inserted.Even if -transaction_size is 1, mlcp might fail to import all documents in the face of a failover event in some cases. For example:
Failover does not succeed within 5 minutes. If it takes more than 5 minutes for MarkLogic to recover from the failure, then mlcp aborts the job and reports an error.
mlcp reports any documents that could not be inserted due to the failover.
The following messages are an example of mlcp output during a failover event. Timestamps have been elided.
A failure of some kind occurs, such as host going down. The exact error messages will depend on the type of failure. Notice that example errors below include a retriable exception.
...INFO contentpump.LocalJobRunner: completed 41% ...WARNING [29] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: '' ...WARN mapreduce.ContentWriter: Batch 981349710.122: Exception:Error parsing HTTP headers: Premature EOF, partial header line read: '' ...WARNING [29] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: '' ...WARN mapreduce.ContentWriter: Batch 981349710.122: Failed rolling back transaction Error parsing HTTP headers: Premature EOF, partial header line read: '' ...WARNING [29] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: '' ...ERROR mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:XDMP-XDQPDISC: XDQP connection disconnected, server=somehost ...ERROR mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:XDMP-XDQPDISC: XDQP connection disconnected, server=somehost ...ERROR mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:XDMP-XDQPDISC: XDQP connection disconnected, server=somehost
mlcp begins retrying the failed insertion. Errors may continue to occur because MarkLogic is still failing over.
...INFO mapreduce.ContentWriter: Batch 981349710.122: Retrying document insert ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused ...INFO mapreduce.ContentWriter: Batch 981349710.122: Retrying document insert ...INFO mapreduce.ContentWriter: Batch 981349710.122: Retrying document insert ...INFO mapreduce.ContentWriter: Batch 981349710.122: Retrying document insert ...WARN mapreduce.ContentWriter: Batch 981349710.122: Exception:Connection refused ...WARN mapreduce.ContentWriter: Batch 981349710.122: Exception:Connection refused ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused ...WARN mapreduce.ContentWriter: Batch 981349710.122: RetryableQueryException:SVC-SOCCONN: Socket connect error: connect 172.18.130.117:7999: Connection refused
Eventually, MarkLogic Server recovers, and the job continues normally.