MarkLogic 9 Product Documentation
Ops Director Guide — Chapter 9

Troubleshooting with Ops Director

Ops Director may facilitate troubleshooting of MarkLogic clusters in your enterprise.

Ops Director has set thresholds on specific metrics to alert you when a metric exceeds a pre-specified value. Many metrics that can help in alerting and troubleshooting are meaningful only in relation to normal patterns of performance. For example, monitoring an App Server for slow queries will require a different threshold on an application that spawns many long-running queries to the task server than on an HTTP App Server where queries are normally in the 100 ms range.

This chapter describes major use cases of troubleshooting with Ops Director. It provides a set of guiding questions to help you understand and identify the metrics that are of interest under various circumstances.

This chapter covers the following topics:

Assess Whether MarkLogic Has Adequate Resources
Assess the Overall State of the System
Assess MarkLogic Cluster Performance
Assess Severity of Problems in the System

Assess Whether MarkLogic Has Adequate Resources

MarkLogic Server is designed to fully utilize system resources. Many settings, such as cache sizes, are auto-sized by MarkLogic Server at installation.

Aspect	Analysis
Sufficient resources forMarkLogic Server on the host machine	What processes other than MarkLogic Server are running on the host and what host resources do those processes require? When competing with other processes, MarkLogic Server cannot optimize resource utilization and consequently cannot optimize performance. The metrics on the MONITOR View will generally give you a high-level view of the loads on your cluster resources. For more detail, use the ANALYZE View.
Sufficient disk space for forest data and merges	Merges require at least one and one half times as much free disk space as used by the forest data (for details, see Memory, Disk Space, and Swap Space Requirements in the Installation Guide). If a merge runs out of disk space, it will fail. The metrics described in Database Performance Data will help isolate disk space problems.
Sufficient disk space to log system activity	If there is no space left on the log file device, MarkLogic Server will abort. Also, if there is no disk space available to add messages to the log files, MarkLogic Server will fail to start. The metrics described in Database Performance Data will help isolate disk space problems.
Sufficient memory for the range indexes	Range indexes improve performance at the cost of memory and increased load/reindex time. Running out of memory for range indexes may result in undesirable memory swapping that severely impacts performance. The metrics described in Memory Performance Data will help isolate memory problems.
Correctness of swap space configuration	At query time, MarkLogic Server makes use of both memory and swap space. If there is not enough of either, the query can fail with SVC-MEMALLOC messages. The metrics described in Memory Performance Data will help isolate swap space problems. For details on configuring swap memory, see Tuning Query Performance in MarkLogic Server in the Query Performance and Tuning Guide.
Number of hosts in a cluster and their configuration	How many hosts are in a cluster? How are the hosts configured as evaluator and data nodes? How are the hosts organized into groups? For details on configuring MarkLogic Server clusters, see Clustering in MarkLogic Server in the Scalability, Availability, and Failover Guide.
Applications with resource-intensive features	Applications with resource-intensive features include CPF, replication, and point-in-time recovery. Are the hardware, software, and network resources available and configured to most efficiently support such applications?

Assess the Overall State of the System

Many problems that impact MarkLogic Server originate outside of MarkLogic Server. Consider the health of your overall environment.

Aspect	Analysis
Efficiency of CPU usage	How much CPU capacity exists at different time slices? What is the execution speed of the current read and write tasks? Can I optimize queries or choose a better time to batch load?
Efficiency of I/O usage	What amount of data is currently being read from or written to disk? Are there any I/O bottlenecks?
Free disk space per filesystem	Is there enough free disk space for each filesystem?
Network state	What is the current state of the network?
Errors messages in application logs	Are there any errors or warnings appearing in the logs of MarkLogic Server or applications?
Errors messages in system logs	Are there any serious errors in the system log files? Your monitor tool, or an auxiliary tool such as Splunk, should monitor your system logs and report on any detected errors.

Assess MarkLogic Cluster Performance

When you suspect an error or performance problem originates from MarkLogic Server, some questions to ask are as follows. Most of these metrics can be viewed on the MONITOR View and ANALYZE View.

Aspect	Analysis
Whether all resources in the cluster are utilized	Are all of the hosts in the cluster online? Are all of the App Servers enabled? In what states are the forests?
Query optimization and load balancing	What are the patterns of queries and updates? Do they appear to be evenly distributed across the hosts in the cluster?
Long-running queries	Longer than usual query execution times may indicate a bottleneck, such as a slow host or problems with XDQP communication between hosts. Other possible problems include increased loads following a failover or more than the usual number of total requests.
Increase in the number of outstanding requests	A consistent increase in the total number of outstanding requests may indicate the need to add more capacity and/or load balance. Decreases in total requests may indicate some upstream problem that needs to be addressed.
I/O rates and loads pattern	In this context, rates refers to amount of data applications are currently reading from or writing to MarkLogic Server databases (throughput) and loads refers to the execution time of the current read and write requests, which includes the time requests spend in the wait queue when maximum throughput is achieved. Under normal circumstances you will see loads go up as rates go up. As the workload (number of queries and updates) increases, a steadily high rates value indicates the maximum database throughput has been achieved. When this occurs, you can expect to see increasing loads, which reflect the additional time requests are spending in the wait queue. As the workload decreases, you can expect to see decreasing loads, which reflect fewer requests in the wait queue. If, while the workload is steady, rates decrease and loads increase, something is probably taking away I/O bandwidth from the database. This may indicate that MarkLogic Server has started a background task, such as a merge operation or some process outside of MarkLogic Server is taking away I/O bandwidth.
Journal and save write rates and loads pattern	During a merge, you should see the rates for journal and save writes decrease and the loads increase. Once the merge is done, journal and save writes rates should increase and the loads should decrease. If no merge is taking place, then a process outside of MarkLogic Server may be taking away I/O bandwidth.
XDQP rates and loads pattern	In this context, rates refers to amount of data hosts are currently reading from or writing to other hosts and loads refers to the execution time of the current read and write requests, including those in the wait queue. A decrease in rates and an increase in loads may indicate that there is network problem.
Cache hit/miss rates	Lots of cache hits means not having to read fragments off disk, so there is less I/O load. An increasing cache miss rate may indicate a need to increase the cache size, write queries that take advantage of indexes to reduce the frequency of disk reads, or adjust the fragment size to better match that of the queried data.
Concurrent updates and reads are in progress	An increase of both updates and reads may indicate that there are queries that are doing too many updates and reads concurrently. The potential problem is lock contention between the updates and reads on the same fragments, which degrades performance.
Database merges are in progress	Merges require both I/O and disk resources. If too many database merges are taking place at the same time, it may be necessary to coordinate merges by creating a merge policy or establishing merge blackout periods, as described in Understanding and Controlling Database Merges in the Administrator's Guide.
Reindexes are in progress	Database reindexing is periodically done automatically in the background by MarkLogic Server and requires both CPU and disk resources. If there are too many reindexing processes going on at the same time, you may need to adjust when reindexing is done for particular databases, as described in Text Indexing in the Administrator's Guide.
Backups and/or restores are in progress	Backup and restore processes can impact the performance of applications and other background tasks in MarkLogic Server, such as merges and indexing. Backups with point-in-time recovery enabled have an even greater impact on performance. If backup and/or restore processes are impacting system performance, it may be necessary to reschedule them, as described in Backing Up and Restoring a Database in the Administrator's Guide.

Assess Severity of Problems in the System

The MONITOR View alerts you to the more serious problems in your MarkLogic clusters. If you are encountering a serious problem in which MarkLogic Server is unable to effectively service your applications, use the following problem analysis for the system troubleshooting.

Problem	Analysis and Workaround/Solution
MarkLogic Server aborts or fails to start	This may indicate that there not enough disk space for the log files on the log file device. If this is the cause, you will need to either add more disk space or free up enough disk space for the log files.
An application is unable to update data in MarkLogic Server	This may indicate that you have exceeded the 64-stand limit for a forest. This could be the result of running out of merge space or that merges are suppressed.
Queries failing with SVC-MEMALLOC messages	This indicates that there is not enough memory or swap space. You may need to add memory or reconfigure your swap memory, as described in Tuning Query Performance in MarkLogic Server in the Query Performance and Tuning Guide.
Forests in the async replicating state	This state indicates that a primary forest is asynchronously catching up to its replica forest after a failover or that a new replica forest was added to a primary forest that already contains content. If a forest has failed over, see Scenarios that Cause a Forest to Fail Over in the Scalability, Availability, and Failover Guide for possible causes.
Messages of the error level and higher in the log files	The various log levels are described in Understanding the Log Levels in the Administrator's Guide. All log messages at the error level and higher should be investigated, whereas lower-level messages, such as warnings and debug messages are mostly informational.

Log messages that indicate a particularly serious problem are listed in the following table.

Error message	Root Cause Analysis
Repeated server restart messages	Possible causes include a corrupted forest, segmentation faults, or some problem with the host's operating system.
XDQP disconnect	Possible causes include an XDQP timeout or a network failure.
Forest unmounted	Possible causes include the forest is disabled, it has run out of merge space, or the forest data is corrupted.
SVC-* errors	These are system-level errors that result from timeouts, socket connect issues, lack of memory, and so on.
XDMP-BAD errors	These indicate serious internal error conditions that should not happen. Look at the error text for details and the logs for context. If you have an active maintenance contract, you can contact MarkLogic Technical Support.