Loading TOC...
Monitoring MarkLogic Guide (PDF)

Monitoring MarkLogic Guide — Chapter 3

MarkLogic Server Monitoring History

This chapter describes how to use the Admin Interface and Monitoring History dashboard to capture and make use of historical performance data for a MarkLogic cluster. These same Monitoring History operations can also be done using the XQuery and REST APIs, as described in XQuery and XSLT Reference Guide and the MarkLogic REST API Reference.

All MB and GB metrics described in this chapter are base-2.

The main topics in the chapter are:

Overview

The Monitoring History feature allows you to capture and view critical performance data from your cluster. Once the performance data has been collected, you can view the data in the Monitoring History page. The top-level Monitoring History page provides an overview of the performance metrics for all of the key resources in your cluster. For each resource, you can drill down for more detail. You can also adjust the time span of the viewed data and apply filters to view the data for select resources to compare and spot exceptions.

By default, the performance data is stored in the Meters database. Monitoring History capture is enabled at the group level. Typically you have one group per cluster. You can also configure a consolidated Meters database that captures performance metrics from multiple groups. The group configuration defines which database is used to store performance metrics for that group (defaulting to a shared Meters database per cluster), as well as all configuration parameters for performance metrics, such as the frequency of data capture and how long to retain the performance data. The Meters database can participate in all normal database replication, security, and failover operations.

Enabling Monitoring History on a Group

In order to collect monitoring history data for your cluster, you must enable performance metering for your group.

  1. Log into the Admin Interface.
  2. Click the Groups icon on the left tree menu.
  3. Locate the Performance Metering Enabled field toward the bottom of the Group Configuration page and click on true.

    You can configure the parameters for collecting monitoring history, as described in the table below.

    ParameterDescription
    meters databaseThe database in which performance monitoring history and usage metrics documents are stored. By default, historical performance and usage metrics are stored in the Meters database.
    performance metering periodThe performance metering period, in minutes. Performance data is collected at each period. The period can be any value of 1 minute or more.

    If you are collecting monitoring history for multiple groups, you should either set the same period for each group or configure your filter to view the history data for one group at a time.

    performance metering retain rawThe number of days raw performance monitoring history data is retained. See Setting the Monitoring History Data Retention Policy for details.
    performance metering retain hourlyThe number of days hourly performance monitoring history data is retained. See Setting the Monitoring History Data Retention Policy for details.
    performance metering retain dailyThe number of days daily performance monitoring history data is retained. See Setting the Monitoring History Data Retention Policy for details.

Setting the Monitoring History Data Retention Policy

The retention policy (for raw, hourly, daily) is a value set in days. If performance metering is enabled, then all data that is older than that many days for the specified period (raw, hour, day) is deleted. The retention policy is set at a group level, so different groups can have different retention policies. For example, GroupA may have raw set to 1 day and GroupB may have raw set to 10 days. The cleanup code follows this retention value on a per-group basis.

There are cases where metering data may become orphaned, so it may no longer belong to an existing group. Some examples of when this could occur are:

  • Deleting a group
  • Importing metering data from another cluster

Any metering data that no longer belongs to any active group in the current cluster is deleted. To avoid this, turn off metering or avoid deleting groups and instead move hosts out of the group but keep the group in the cluster configuration.

Loading older Monitoring History data (for example, by restoring a backup of the Meters database) will be immediately affected by data retention policy. So, you should turn off perfomance metering prior to restoring any data that is older than the time specified by your retention policy.

Deletion of data older then the retention policy occurs no sooner than the retention policy, but may, for various reasons, still be maintained for an unspecified amount of time.

Changing the retention policy from smaller to larger values does not restore data that has already been deleted.

The default data retention policy settings are as shown in the table below. To maximize efficiency, it is a best practice to retain raw data for the least number of days and the daily data for the most number of days.

PeriodRetention Period
Raw7 Days
Hourly30 Days
Daily90 Days

Viewing Monitoring History

You can display the Monitoring History by doing the following:

  1. Open a browser and enter the URL:
    http://monitor-host:8002/

    where monitor-host is a host in the cluster you want to monitor

  2. At the top of the page, click on Monitoring and click on History in the pull-down menu:

  3. The Monitoring History page appears. From the Monitoring History Overview page, you can navigate to any of the pages described in this chapter.

Each line in a chart represents a metric for the resource. In the Overview page, the lines represent an aggregate of the metrics for all of the cluster resources. In each Details page, the lines represent the metric for each specific resource.

Each point on a line represents a period in which the performance data was captured. Hovering over a chart point displays the name of the resource metric, along with the performance value for the metric at that point in time.

The displayed metrics (in MegaBytes per second) are color coded. You can display a legend that indicates which colors represent which metrics by clicking on the red dot in the upper right-hand section of the graph. To close the legend, click on the ‘x' in the upper right-hand portion of the legend window.

To simplify the view of charts on a page, you can collapse a chart or a group of charts for a resource by clicking on the triangle in the upper right-hard portion of the chart or chart group.

To expand a collapsed chart view, click on the triangle in the upper right-hard portion of the collapsed chart.

Viewing Monitoring History by Time Span and Frequency

As described in Enabling Monitoring History on a Group, the frequency in which performance metrics are captured is configurable, in minute intervals. The snapshots of performance metrics for each host are rolled up into a summary document that contains aggregate calculations on the values for that host.

You can configure your view of the captured performance data by time span and frequency.

The Time Span settings are located in the upper left-hand corner of the Monitoring History page.

There are three basic settings you can adjust to control how the data is displayed:

  • A date/time range, down to the granularity of a minute, that determines the time span of the displayed data. (By default, this is the last 24 hours.)
  • A period interval that determines the frequency of the displayed data. The possible intervals are shown in the following table.
    PeriodDescription
    RawDisplay the performance data just as it was captured with the set frequency.
    HourDisplay the performance data, in aggregate form, per hour. (This is the default.)
    DayDisplay the performance data, in aggregate form, per day.

You can 'zoom in' to display part of the timespan by selecting the begin time of your 'zoom' on any chart and click and hold your left mouse button and drag it to the end 'zoom' time. The selected timeframe is highlighted and the zoomed-in time is displayed for all of the charts in the page. Navigating to another Monitoring History page resets all of the charts to the timespan selected in the TIME SPAN panel.

After changing either the time span and/or the period, click on refresh to display the updated charts. Clicking refresh will also update any changes you've made to the Filters settings. For details about filters, see Filtering Monitoring History by Resources. If you have zoomed into a portion of a timespan, refresh will redisplay the charts using the timespan selected in the TIME SPAN panel.

You can use the Shortcut links to display either the last hour, day or 30 days of performance data. Selecting a Shortcut link will automatically refresh the displayed charts.

Each Shortcut also sets the Period value, as shown in the table below.

ShortcutPeriod
1hRaw
1dHour
30dDay

Labeling Monitoring History Time Spans

You can use the Label feature to capture and tag metrics for the set time span. You can store any number of labels. These labels can be used to identify events, instances, and periods of time. Labels can be added, updated or deleted at any time. Labels themselves are not stored with the raw metric data. They are only used for reporting purposes.

  1. To create a label for your current view of the Monitoring History data, select New Label from the Label pull-down menu.

  2. In the Create a New Label popup window, the name of the label is the time span of the currently displayed charts, by default.

  3. You can keep the default name for the label, or change it to be more descriptive. Click Save.

  4. You can edit your label names or delete labels by selecting Edit Labels from the Labels pull-down menu.

  5. In the Edit Labels popup window, you can either edit the label name or delete the label. To delete a label, hover over the label and a click on the garbage can icon to the right. When finished editing, click Close.

    If you edit a label and, before closing the Edit Labels window, decide not to save your edits, press the Esc key to terminate the edits and keep the original labels.

  6. You can view all of the labels that have data within the currently selected timespan by clicking on the triangle to the right of the Labels section at the top of the Monitoring History page to expand the Labels chart.

  1. Each label appears as a timeline. Hover over a timeline to display the label name. Click on a timeline to update the view to the time span associated with the label. Selecting a timeline is functionally equivalent to selecting a label from the Label menu in that it updates the view with the start and end times in the TIME SPAN panel.

If your labeled data has been purged from the Meters database, as the result of the retention policy or some other reason, the label will remain but there will be no data associated with that label.

  1. You can click on the label icon at the top right-hand portion of the page to create a label for the currently displayed time span. Follow the same procedure as described in steps 2 and 3 to finish creating the label.

If the data for a label does not fall within the currently displayed timespan, the label will not be displayed in the Labels chart. To display the charts for such labels, select the label from the Label pull-down menu.

Filtering Monitoring History by Resources

You can set filters for select resources to display only the stored performance metrics for those resources. You can filter by groups and databases. And in each group, by hosts and servers. By default, the metrics for all of the resources in the cluster are displayed.

Filter types that are active for the current view have headings highlighted in blue. For example, on the Overview page, all filters are active while on the Databases Detail view, only database resources are active.

In the filters panel, you can check or uncheck a resource to display or not display the performance metrics for that resource.

In order to focus on the resources of interest, you can collapse a category by clicking on the triangle in the right-hand section of the panel. The number of resources for the collapsed category are displayed.

Clicking the checkmark updates the charts with the current filter settings. It does not apply any changes that may have been made to the above TIME SPAN settings.

You can mouse over the resource names in the filter list to get extra information about the resources. For example, mousing over a host name shows the number of forests associated with the host and mousing over a server name shows the server type.

Historical Performance Charts by Resource

From the Monitoring History dashboard, you can view Overview and Detailed performance metrics in graph form for each resource in the cluster. In the Overview page, the lines on a graph represent an aggregate of the metrics for all of the cluster resources of that type. In each Details page, the lines represent the metric for each specific resource in the cluster.

To view the Detail page for a resource, click on the down arrow at the upper left-hand section of the resource graph on the Overview page.

To return to the Overview page from a Detail page, click on the up arrow at the upper left-hand section of the resource graph on the Detail page.

This section describes the Overview and Detail pages for the following resources:

Disk Performance Data

The Overview page displays a graph of the aggregate I/O performance data for the disks used by the hosts selected in the filter.

As described in Viewing Monitoring History, you can hover on a period point to view what disk operation was taking place at that point in time. Each performance metric is described in the table below.

MetricDescription
WritesThe disk I/O performance during journal and save write operations. This is the sum of journal-write-rate, save-write-rate, and large-write-rate.
Query TrafficThe disk I/O performance during a query or queries. This is is the sum of query-read-rate and large-read-rate.
Merge ReadsThe disk I/O performance during a merge read operation.
Merge WritesThe disk I/O performance during a merge write operation.
Backup The disk I/O performance during a backup operation. This is the sum of backup-write-rate and backup-read-rate.
RestoreThe disk I/O performance during a restore operation. This is the sum of restore-read-rate and restore-write-rate.

Click on the arrow in the upper left-hand section of the DISKS graph in the Overview page to view charts that present more detailed disk performance metrics.

The metrics displayed by the charts on the DISKS DETAIL page are described in the table below.

ChartDefinition of Displayed Metric
Journal Write RateThe moving average of data writes to the journal.
Save Write RateThe moving average of data writes to in-memory stands.
Query Read RateThe moving average of reading query data from disk
Merge Read RateThe moving average of reading merge data from disk
Merge Write RateThe moving average of writing data for merges
Backup RateThe moving average of reading and writing backup data to disk. This is the sum of backup-write-rate and backup-read-rate.
Restore RateThe moving average of reading and writing restore data from disk. This is the sum of restore-read-rate and restore-write-rate.
Large Binary Read RateThe moving average of reading large documents from disk.
Large Binary Write RateThe moving average of writing data for large documents to disk.

By default, Host data is viewed in aggregated form and must be viewed that way if multiple hosts are selected. When in the DISK DETAIL page, you can rollover any Host filter to reveal the Select and Expand button. This will deselect all of the other Hosts across all Groups, and apply all pending filter changes. The expanded charts display the data for each forest in that host as separate line in each chart.

-

To return to the aggregate view, click on Aggregate button on an expanded Host. Doing so will also apply all pending filter changes to the displayed charts.

CPU Performance Data

The Overview page displays a graph of the aggregate I/O performance data for the CPUs used by the hosts selected in the filter.

As described in Viewing Monitoring History, you can hover on a period point to view what CPU operation was taking place at that point in time. Each performance metric in the CPU Overview chart is described in the table below.

MetricDescription
UserTotal percentage of CPU used running user processes that are not niced.
NiceTotal percentage of CPU used running user processes that are niced.
SystemTotal percentage of CPU used running the operating system kernel and its processes.
I/O WaitTotal percentage of CPU time spent waiting for I/O operations to complete.
IRQTotal percentage of CPU utilization for servicing soft interrupts.
StealTotal percentage of CPU ‘stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).

Click on the arrow in the upper left-hand section of the CPU graph in the Overview page to view graphs that present more detailed CPU performance metrics. The charts on the CPU DETAIL page are described in the table below.

ChartDescription
I/O WaitThe percentage of CPU used waiting for I/O operations to complete for each host.
UserThe percentage of CPU used running user processes that are not niced for each host.
SystemThe percentage of CPU used running the operating system kernel and its processes for each host.
NiceThe percentage of CPU used running user processes that are niced for each host.
StealThe percentage of CPU ‘stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine) for each host.
IdleThe percentage of CPU that is not doing any work for each host.
IRQThe percentage of CPU servicing soft interrupts for each host.

Memory Performance Data

The Overview page displays a graph of the aggregate performance data for the Memory used by the hosts selected in the filter.

As described in Viewing Monitoring History, you can hover on a period point to view what CPU operation was taking place at that point in time. Each chart and associated performance metrics are described in the table below.

ChartDescription
Memory Footprint

The total amount (in GB) of memory consumed by all of the hosts in the cluster.

The displayed metrics are:

  • RSS: The total amount of GB of Process Resident Size (RSS) consumed by the cluster.
  • Anon: The total amount of GB of Process Anonymous Memory consumed by the cluster.
Memory I/O

The number of pages per second moved between memory and disk.

The displayed metrics are:

  • Page-In Rate: The page-in rate (from Linux /proc/vmstat) for the cluster in pages/sec.
  • Page-Out Rate: The page-out rate (from Linux /proc/vmstat) for the cluster in pages/sec.
  • Swap-In Rate: The swap-in rate (from Linux /proc/vmstat) for the cluster in pages/sec.
  • Swap-Out Rate: The swap-out rate (from Linux /proc/vmstat) for the cluster in pages/sec.

Click on the arrow in the upper left-hand section of the MEMORY graph in the Overview page to view graphs that present more detailed MEMORY performance metrics. The charts on the MEMORY DETAIL page are described in the table below. The displayed metrics are drawn from /proc/vmstat.

ChartDescription
RSSThe amount of GB of Process Resident Size (RSS) for each host in the cluster.
AnonThe amount of GB of Process Anonymous Memory for each host in the cluster.
Page-In RateThe page-in rate (in pages/sec) for each host in the cluster.
Page-Out RateThe page-out rate (in pages/sec) for each host in the cluster.
Swap-In RateThe swap-in rate (in pages/sec) for each host in the cluster.
Swap-Out RateThe swap-out rate (in pages/sec) for each host in the cluster.

Server Performance Data

The Overview page displays graphs of the aggregate performance data for the App Servers selected in the filter.

The Overview page displays the charts described in the table below.

ChartDescription
App Server Request RateThe total number of queries being processed per second, across all of the App Servers.
App Server LatencyThe average time (in seconds) it takes to process queries, across all of the App Servers.
Task Server Queue SizeThe number of tasks in the Task Server queue.
Expanded Tree Cache Hits/MissesThe number of times per second that queries could use (Hits) and could not use (Misses) the expanded tree cache.

With the exception of the Task Server Queue Size chart, which only displays the queue size for the one task server, the color-coded metrics for the server charts are as shown in the table below.

MetricDescription
HTTPThe metrics for the HTTP servers.
ODBCThe metrics for the ODBC servers.
WebDAVThe metrics for the WebDAV servers.
XDBCThe metrics for the XDBC servers.
Task The metrics for the Task server.

Click on the arrow in the upper left-hand section of the SERVERS graph in the Overview page to view graphs that present more detailed performance metrics for each App Server. The charts displayed on the SERVERS DETAIL page are described in the table below.

If there are multiple groups defined, server names have the group that they are associated with in square brackets in the legend and rollovers.

The number of servers displayed out of the number of servers of each type in the cluster (for example, HTTP) is shown in the upper right-hand section of each server type group.

The following detailed charts are displayed for each type of App Server:

ChartDescription
Request RateThe number of queries being processed per second by each App Server.
LatencyThe average time it takes each App Server to process queries.
Expanded Tree Cache Rate HitsThe number of times queries could use the expanded tree cache on each App Server.
Expanded Tree Cache Rate MissesThe number of times queries could not use the expanded tree cache on each App Server.
Queue Size (Task Server Only)The number of tasks in the Task Server queue on each host.

Network Performance Data

The network performance data graphs display performance in terms of XDQP reads and writes. XDQP is the protocol MarkLogic uses for internal host-to-host communication on port 7999.

The Overview page displays various XDQP performance as the sum of XDQP activity across the cluster. High XDQP rates are usually not an issue unless they are so high as to saturate your internal network. Higher usage occurs during data load and query execution. Merges do not involve XDQP.

If XDQP is excessively high during loads, running the MarkLogic Content Pump (mlcp) with fast forest placement will minimize XDQP communication needs. For details on the MarkLogic Content Pump, see Loading Content Using MarkLogic Content Pump in the Loading Content Into MarkLogic Server Guide.

The Overview page displays a chart with the metrics described in the table below.

MetricDescription
XDQP ReadThe total volume of all XDQP reads between hosts in the cluster. This is the sum of xdqp-client-receive-rate and xdqp-server-receive-rate.
XDQP WriteThe total volume of all XDQP writes between hosts in the cluster. This is the sum of xdqp-client-send-rate and xdqp-server-send-rate.
Foreign XDQP ReadThe total volume of all XDQP reads by the hosts in the cluster from a foreign cluster. This is the sum of foreign-xdqp-client-receive-rate and foreign-xdqp-server-receive-rate.
Foreign XDQP WriteThe total volume of all XDQP writes by the hosts in the cluster to a foreign cluster. This is the sum of foreign-xdqp-client-send-rate and foreign-xdqp-server-send-rate.

Click on the arrow in the upper left-hand section of the NETWORK graph in the Overview page to view graphs that present more detailed performance metrics for each host in the cluster. The charts displayed on the NETWORK DETAIL page are described in the table below.

ChartDescription
XDQP Read RateThe amount of data (in MB/sec) read over XDQP by each host in the cluster. This is the sum of foreign-xdqp-client-receive-rate and foreign-xdqp-server-receive-rate.
XDQP Write RateThe amount of data (in MB/sec) written over XDQP by each host in the cluster. This is the sum of foreign-xdqp-client-send-rate and foreign-xdqp-server-send-rate.
XDQP Read LoadThe execution time (in seconds) of read requests by each host in the cluster. This is the sum of xdqp-client-receive-load and xdqp-server-receive-load.
XDQP Write LoadThe execution time (in seconds) of write requests by each host in the cluster. This is the sum of xdqp-client-send-load and xdqp-server-send-load.
Foreign XDQP Read RateThe amount of data (in MB/sec) read over XDQP by each host in the cluster from a foreign cluster. This is the sum of foreign-xdqp-client-receive-rate and foreign-xdqp-server-receive-rate.
Foreign XDQP Write RateThe amount of data (in MB/sec) written over XDQP by each host in the cluster to a foreign cluster. This is the sum of foreign-xdqp-client-send-rate and foreign-xdqp-server-send-rate.
Foreign XDQP Read LoadThe execution time (in seconds) of read requests by each host in the cluster from a foreign cluster. This is the sum of foreign-xdqp-client-receive-load and foreign-xdqp-server-receive-load.
Foreign XDQP Write LoadThe execution time (in seconds) of write requests by each host in the cluster to a foreign cluster. This is the sum of foreign-xdqp-client-send-load and foreign-xdqp-server-send-load.

Database Performance Data

The Overview page displays graphs of the aggregate performance data for all of the databases in the cluster.

The table below describes the charts displayed in the Databases section of the Overview page.

ChartDescription
Fragments

Displays the aggregate number of fragments in all of the databases in the cluster.

The displayed lines are:

  • Active Fragments: The fragments available to queries.
  • Deleted Fragments: The fragments to be deleted during the next merge operation.
Storage FootPrint

The total disk capacity (in GBs) used by all of the databases in the cluster.

The displayed lines are:

  • Data Size: The amount of data in the forest data directories.
  • Fast Data Size: The amount of data in the forest fast data directories.
  • Large Data Size: The amount of data in the forest large data directories.
Lock Rate

The number of locks set per second across all of the databases in the cluster.

The displayed lines are:

  • Read: The number of read locks set per second.
  • Write: The number of write locks set per second.
  • Deadlock: The number of deadlocks per second.
Lock Wait Load

The aggregate time (in seconds) transactions wait for locks;

The displayed lines are:

  • Read: The time transactions wait for read locks.
  • Write: The time transactions wait for write locks.
Lock Hold Load

The aggregate time (in seconds) locks are held.

The displayed lines are:

  • Read: The time read locks are held.
  • Write: The time write locks are held.
Deadlock Wait LoadThe aggregate time (in seconds) deadlocks remain unresolved.
Database Replication

The amount of data (in MB per second) sent by and received from this cluster and foreign clusters.

The displayed lines are:

  • Database Replication Send: The amount of data sent to foreign clusters.
  • Database Replication Receive: The amount of data received from foreign clusters.

Click on the arrow in the upper left-hand section of the DATABASES graph in the Overview page to view graphs that present more detailed performance metrics for each database. The charts displayed on the DATABASES DETAIL page are described in the table below. The metrics for each database in the cluster are displayed as a separate line.

ChartDescription
Active FragmentsThe number of active fragments (the fragments available to queries) in each database.
Deleted FragmentsThe number of deleted fragments (the fragments to be removed by the next merge operation) in each database.
Data SizeThe amount of data in the data directories of the forests attached to each database.
Fast Data SizeThe amount of data in the fast data directories of the forests attached to each database.
Large Data SizeThe amount of data in the large data directories of the forests attached to each database.
Read Lock RateThe number of read locks set per second on each database.
Write Lock RateThe number of write locks set per second on each database.
Deadlock RateThe number of deadlocks per second on each database.
Read Lock Wait LoadThe time (in seconds) transactions wait for read locks on each database.
Write Lock Wait LoadThe time (in seconds) transactions wait for write locks on each database.
Deadlock Wait LoadThe aggregate time (in seconds) deadlocks remain unresolved on each database.
Read Lock Hold LoadThe time (in seconds) read locks are held on each database.
Write Lock Hold LoadThe time (in seconds) write locks are held on each database.
Database Replication Send RateThe amount of replication data (in MB per second) sent by each database to foreign clusters.
Database Replication Receive RateThe amount of replication data (in MB per second) received by each database from foreign clusters.
Database Replication Send LoadThe time (in seconds) it takes each database to send replication data to foreign clusters.
Database Replication Receive LoadThe time (in seconds) it takes each database to receive replication data from foreign clusters.

Exporting and Printing Monitoring History

You can export and print your Monitoring History data.

To export the Monitoring History data to an Excel Spreadsheet file, click the Export at the upper-right portion of the Monitoring History page.

The metrics are displayed in separate tabs at the bottom of the spreadsheet.

To print out the charts displayed on the current page, click Print. This will open the printer dialog page from which you can print the charts.

« Previous chapter
Next chapter »
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy