Storage Failure Detection and Failover

MarkLogic 11 extends the existing high-availability features to detect when there are issues with storage, and optionally, trigger failover. MarkLogic 11 uses a background process to monitor all file systems that store forest data. If there are significant delays accessing those file systems, MarkLogic will now log messages at progressively higher levels the longer the delays are.

5 to 10 seconds - Debug log message
- Debug: Forest::<forest name> read file <forest data directory>/DiskCheck hang for <x> seconds
>= 10 seconds - Warning log message
- Warning: Forest::<forest name> read file <forest data directory>/DiskCheck hang for <x> seconds
>= 60 seconds - Error log message
- Error: Forest::<forest name> read file <forest data directory>/DiskCheck hang for <x> seconds
>= 180 seconds - Critical log message
- Critical: Forest::<forest name> read file <forest data directory>/DiskCheck hang for <x> seconds

Failover

Building on the new storage monitoring mechanism, MarkLogic 11 also provides the ability to trigger a failover if the storage cannot be accessed for a configurable amount of time. There are two new database configuration settings listed below that allow this to be controlled:

Setting	Description
shutdown-on-storage-failure	When set to true, if storage for any primary forest for the database cannot be accessed for more than the configured timeout, the MarkLogic process will shut itself down, triggering a failover of any forests that are being managed by the MarkLogic host. This is disabled by default. The following new admin functions can be used to enable or disable this behavior: `admin.databaseGetShutdownOnStorageFailure()` `admin.databaseSetShutdownOnStorageFailure()`
storage-failure-timeout	If "shutdown-on-storage-failure" is set to true, and storage cannot be accessed, this is the number of seconds to wait before triggering a MarkLogic shutdown. This is set to 60 seconds by default. The following new admin functions can be used to control this setting: `admin.databaseGetStorageFailureTimeout()` `admin.databaseSetStorageFailureTimeout()`

Setting

Description

shutdown-on-storage-failure

When set to true, if storage for any primary forest for the database cannot be accessed for more than the configured timeout, the MarkLogic process will shut itself down, triggering a failover of any forests that are being managed by the MarkLogic host. This is disabled by default.

The following new admin functions can be used to enable or disable this behavior:

storage-failure-timeout

If "shutdown-on-storage-failure" is set to true, and storage cannot be accessed, this is the number of seconds to wait before triggering a MarkLogic shutdown. This is set to 60 seconds by default.

The following new admin functions can be used to control this setting:

Caution

The "shutdown-on-storage-failure" setting should only be used for databases that have forests configured for high availability with local-disk failover. See Configuring Local-Disk Failover for a Forest in the Scalability, Availability, and Failover Guide for how to configure local-disk failover. This feature should not be used with shared-disk failover.

If a MarkLogic host has shut itself down due to a detected storage failure, procedures should be taken to repair the storage before restarting the MarkLogic process. Follow standard procedures for Reverting a Failed Over Forest Back to the Primary Host in the Scalability, Availability, and Failover Guideonce the storage has been repaired and the MarkLogic host is restarted.

Notice

When MarkLogic shuts down due to a detected storage failure, it attempts to write a marker file to the default data directory (typically /var/opt/MarkLogic). When restarting, if MarkLogic sees this marker file, it will pause for two minutes prior to startup and then attempt to clear the marker file. This is to allow adequate time for failover to occur if there is an automated process in place to restart MarkLogic.

In this section:

What's New in MarkLogic 11