Loading TOC...
Query Performance and Tuning Guide (PDF)

MarkLogic 10 Product Documentation
Query Performance and Tuning Guide
— Chapter 1

Tuning Query Performance in MarkLogic Server

This chapter describes some general issues involving query performance in MarkLogic Server, and includes the following sections:

Overview of Query Performance

MarkLogic Server is designed to search extremely large content sets, while providing fine-grained control over the search and access of the content. Performance is always an important component in a search application. In many cases, applications will be extremely fast with no tuning whatsoever. There are, however, many tools and techniques to help make queries faster.

There are several things to consider when looking at query performance:

  • Application requirements: how fast does performance need to be for your application? It is often useful to quantify this at application design time. Factors such as who will be using the application, what any user expectations for performance are, and whether the application will be publicly available are important considerations in defining performance requirements.
  • Indexing options: what indexes are defined for the database? Indexing options play an important role in how well queries can be resolved from the indexes. The fastest way to resolve a query is directly from the indexes. For details on database options, see the chapters Databases and Text Indexing in the Administrator's Guide.
  • XQuery code: is your code written in the most efficient way possible? Sometimes, code runs more slowly than necessary because there are redundant or unneeded function calls. Or there may be a MarkLogic XQuery built-in function that performs an equivalent task more efficiently. Functions such as xdmp:estimate, cts:search, lexicon functions, and so on are all designed for fast performance.
  • More indexes and lexicons: can range indexes and lexicons speed up your queries? For queries that access values and/or do comparisons on those values, range indexes can greatly speed performance. Range indexes are memory mapped structures, so they can retrieve the values without ever needing to access the documents. Lexicons are lists of words or values, and they too can greatly speed up certain types of queries.
  • Server tuning: are your server parameters set appropriately for your system? In most cases, the parameters set during installation work well for the system in which MarkLogic Server is installed. Nevertheless, there are cases where you might need to change some parameters, either for a short-term need or for ongoing needs.
  • Scalability: is your system sufficiently large for your needs? Memory, disk space and quality, swap space, number of processors, and number of servers all contribute to the overall scalability of a MarkLogic Server system. MarkLogic Server is designed to scale to very large clusters with extremely large amounts of content.
  • Access patterns and resource requirements differ for analytic workloads. In general, analytic workloads access and aggregate more data per transaction, increasing the baseline memory requirements. Although there are stated minimum memory requirements for MarkLogic Server, the memory requirements for analytics should be higher than those stated.

This chapter and this book, as well as the Application Developer's Guide, provide information and techniques on tuning a system for optimal performance. The nature of tuning exercises is that they tend to be content-specific, so you cannot always pinpoint a particular recipe that will work for every situation. Getting to know the tools available, the XQuery APIs, and how MarkLogic Server works is the best way to make your applications run extremely fast.

General Techniques to Tune Performance

This section lists some general techniques useful in tuning performance, and provides links to places in the documentation where there is more information on a subject. It contains the following parts:

Search Built-In APIs

The search built-in XQuery APIs are designed to provide very fast searches. The APIs (cts:search, xdmp:estimate, cts:element-values, and so on) use the indexes for fast search performance. The composable cts:query constructors make it easy to compose complex search queries with fast performance. For details on the search built-in XQuery APIs, see MarkLogic XQuery and XSLT Function Reference. For details on the constructors, see Composing cts:query Expressions in the Search Developer's Guide.

Lexicons For Unique Word or Value Lookups

MarkLogic Server allows you to create lexicons, which are lists of unique words or values in a database. Lexicons allow for very fast lookups, and in the case of values, also provide very fast counts. For details on lexicons, see the chapter Browsing With Lexicons in the Search Developer's Guide.

Range Queries for Constraining Searches to a Range of Values

Range queries allow you to specify queries that use range indexes in a cts:query expression. Range queries can both improve performance and make it easier to build applications that constrain on values. For details on range queries, see Using Range Queries in cts:query Expressions in the Search Developer's Guide.

Positions Indexes Can Help Speed Phrase Searches

If you specify word positions in the database configuration, it can speed phrase searches. During the index resolution phase of query processing, MarkLogic Server determines if words are next to each other based on their positions. For example, if you search for the phrase "to be or not to be", MarkLogic Server can eliminate as possible matches, based on positions, most occurrences of these common words because they do not have the proper word next to it. This speeds performance in two ways: it lowers the number of I/Os needed to retrieve candidate fragments, and it makes the filtering phase faster because there are less candidate fragments to filter. For details about how search processing works, see Understanding the Search Process.

Use Query Meters and Query Trace to Characterize Performance

There are two XQuery functions to help you characterize the performance of queries: xdmp:query-meters and xdmp:query-trace. The former provides timing of a query and the latter logs details of the query evaluation to the ErrorLog.txt file. For details on these APIs, see Tuning Queries with query-meters and query-trace and the MarkLogic XQuery and XSLT Function Reference.

Profiler API

MarkLogic Server has a profiler to help determine where a query is spending time processing. For details on the profiler, see Profiling Requests to Evaluate Performance and the MarkLogic XQuery and XSLT Function Reference.

Monitoring API and Status Screens

There are APIs and status screens in the Admin Interface to monitor activities on your system. These can be useful in identifying bottlenecks on your system. For details, see Monitoring MarkLogic Server Performance.

Index Options, Range Indexes, Fields

There are many types of index options, including several types of wildcard indexes, element indexes, stemmed indexes, element and attribute range indexes, and so on. Depending on your needs, these indexes can help speed performance. Indexes tend to take more disk space and increase loading times, but can greatly improve performance.

Fields are another way of improving performance, especially if you are only interested in searching through certain included elements, or you want your searches to exclude particular elements. For details on fields, see Fields Database Settings in the Administrator's Guide.

Understanding MarkLogic Server Caches

MarkLogic Server has several caches used in query processing, defined on the group configuration page. The list cache stores termlists in memory, the compressed tree cache stores compressed fragment data in memory, and the expanded tree cache stores uncompressed fragment data in memory. Additionally, there are several other caches used for security objects, modules, schemas, and so on; these other caches cannot be configured. In most cases, if the caches fill up, they will move older data out to make room for newer content.

In some cases, however, it is possible to run a query that will fail because a cache was full. Particularly, when the expanded tree cache gets full, a query can fail with an XDMP-TREECACHEFULL exception. The following are some guidelines to avoid XDMP-TREECACHEFULL errors:

  • Avoid queries that return the entire database. Instead, return the results in batches (a page at a time, like a classic search page, for example).
  • Try to rewrite the query in a more efficient way.
  • Make sure swap space is configured properly on your server.
  • If you do not have sufficient memory on your server, consider adding more memory to the system.
  • You can raise the sizes of the caches, but that might be a temporary fix.
  • 64-bit systems are recommended. 64-bit systems can hold a lot more memory, and more memory means larger caches.

Rules of Thumb for Sizing

The following are some rule-of-thumb sizing recommendations. These recommendations are best practices based on experience with MarkLogic Server implementations. Also, some of these recommendations are content specific. Performing experiments on your own content is a good way to validate any expansions of these rules of thumb, but these provide a good starting point.

  • For each node, provide memory based either on the formula: (threads x cores x 4GB) or 64GB. Whichever is higher.
  • Allot approximately 1 GB of RAM for each 10-20 GB of forest data. More memory will help, too, especially if you have a lot of range indexes and/or lexicons.
  • 64-bit systems greatly increase the address space so you can address more than 4GB of RAM.
  • Plan to configure no more than 512 primary forests per database. There is a hard limit of 1024 primary forests in a single database.
  • The swap space should be at least 2x memory (or the recommended amount for your platform, as described in Memory, Disk Space, and Swap Space Requirements in the Installation Guide). This is important to make sure MarkLogic Server does not run out of memory. At query time, MarkLogic Server asks the operating system to reserve both memory and swap space. If there is not enough of either, the query can fail with SVC-MEMALLOC messages. These messages can happen if you do not have the recommended amount of swap space. If you do have enough swap space and still get these errors, it can indicate that you either need to increase the amount of memory in the system or lower the amount of memory being used, either by modifying your queries or lowering some of the sizes of server caches, lowering the number of threads the server can service, and so on.
  • For updates, make the journal size larger if you have a lot of range index data. A symptom of this as a problem is journal-full errors.
  • For updates, make the journal size larger if your transactions span multiple forests. The journals must keep the lock information for all documents involved in the transaction, not just for the documents in the journal for the forest in which the document exists. A symptom of this as a problem is journal-full errors.
  • There is a limit of 65k for the size of a string literal or a token in an XQuery program. If you need to input a string longer than 65k, use an external variable with the xdmp:invoke API. External variables are limited to a single node or a string, and in XCC are limited to string only. In XCC, if you need to input a node as an external variable, you must quote it as a string on input and then unquote it (xdmp:unquote) into a node in your XQuery function. Note that this limit is only for the size of a string literal or a token; XQuery program sizes are limited only by the cache size.
« Table of contents
Next chapter »