MarkLogic Server is designed for extremely large content sets, and to support these large content sets, scales to clusters of hundreds of machines, each of which runs MarkLogic Server. Each machine in a MarkLogic Server cluster is called a host, and a host is sometimes referred to as a node in the cluster. This chapter provides an overview of the goals of MarkLogic Server scalability, and includes the following sections:
For information on how clusters work, see Clustering in MarkLogic Server.
When determining the scale of a MarkLogic Server deployment, there are several factors to analyze. These factors are largely requirements-based. The size of a deployment, including how much hardware a deployment should use, will vary based on these and other factors. This section is not intended to provide an exact formula for determining machine requirements, but rather it highlights the types of questions you should be asking. For a specific requirements analysis of your environment, contact MarkLogic Consulting. The following topics are included:
One of the primary drivers of a large-scale deployment of MarkLogic Server is a large amount of content. Large can mean many things, but typically it means 100s of gigabytes, terabytes (1000s of gigabytes), or more, measured in terms of raw XML. The disk space used by a database will range from less than the size of the XML content to approximately 10 times the size, depending on the indexing options, the number of range indexes, and how well the content compresses.
To handle extremely large content sets, you scale the number of data nodes in the MarkLogic Server cluster. MarkLogic Server is designed to scale to 100s of nodes or beyond.
Many applications have fast query performance as an important requirement. MarkLogic Server is built with solid foundations derived from both database and search engine architectures. Consequently, updates become available for querying as soon as they commit and queries against extremely large content sets return very quickly.
MarkLogic Server evaluates queries on an evaluator node (e-node), and the e-node gathers any needed content from the data nodes (d-nodes). If the content set is large and spread across several d-nodes, each of those d-nodes is involved in the query (even if only to inform the e-node that it has no content matching the query). The calls to the d-nodes are returned to the e-node in parallel, and because the content is indexed when it is loaded, the calls to each of the d-nodes return very quickly.
The number of users accessing an application is another dimension in which to measure scalability. In MarkLogic Server, users typically access an evaluator node, and a single e-node can service a large number of users. How large a number will depend on what the application is doing, how large the server is (for example, how much memory and how many vCPUs), as well as other factors. When the e-nodes are getting maxed out, you can scale the system by adding more e-nodes.
If your scalability challenge is large numbers of users, you can add e-nodes to a MarkLogic Server cluster and be able to see immediate benefits by routing queries across the various e-nodes in the cluster. The process of adding an e-node is simply a matter of bringing another MarkLogic Server machine online and joining that machine to the cluster.
As your content grows in size, you might need to add forests to your database. There is no limit to the number of forests in a database, but there are some guidelines for individual forest sizes where, if the guidelines are greatly exceeded, then you might see performance degradation.
The numbers in these guidelines are not exact, and they can vary considerably based on the content. Rather, they are approximate, rule-of-thumb sizes. These numbers are based on average sized fragments of 10k to 100k. If your fragments are much larger on average, or if you have a lot of large binary documents, then the forests can probably be larger before running into any performance degradation.
The rule-of-thumb maximum size for a forest is 512GB. Each forest should ideally have two vCPUs of processing power available on its host, with 8GB memory per vCPU. For example, a host with eight vCPUs and 64GB memory can manage four 512GB forests. For bare-metal systems, a hardware thread (hyperthread), is equivalent to a vCPU. It is a good idea to run performance tests with your own workload and content. If you have many configured indexes you may need more memory. Memory requirements may also increase over time as projects evolve and forests grow with more content and more indexes.
Another requirement that becomes increasingly important as an application becomes business-critical is high availability. A high availability system can continue functioning and processing queries, even if a part of the system goes down. For example, if a computer that hosts a data node has a hardware failure that causes it to shut down, a high availability system will be able to recover from such a failure. This section describes the following aspects of high availability in MarkLogic Server:
MarkLogic Server is designed for large-scale, production-class applications. These types of applications require fast and reliable performance, and also require robust recovery from power outages, application errors, or software failures.There are many features in MarkLogic Server designed to keep the server running and available, whether your deployment is a single instance of MarkLogic Server or has hundreds of servers:
When examining your system requirements, if you find that having extreme levels of system uptime is critical to your application, then high availability is clearly an important requirement for your system. When thinking about high availability for a system, you perform a thorough examination of all points of failure for your system, throughout the entire software and hardware stack in which your application runs.
From the MarkLogic Server point of view, consider how (and if) to deal with failures for both d-nodes and for e-nodes. For d-node failure, you can choose one of the failover types (see High Availability With the Local-Disk and Shared-Disk Failover). For e-node failure, the options are slightly different for applications that use the different types of App Servers (HTTP Servers, XDBC Servers, WebDAV Servers, and CPF-based applications or others that use the Task Server) available in MarkLogic Server. For example, a hardware or software-based load balancer or router might be appropriate for HTTP Server-based application, while you might want to code that logic directly into your XCC-based applications (for example, by taking advantage of the XCC Connection Provider SPI package to build failover capability directly into your XCC stack without modifying your applications).
As with any requirements, you tend to make trade-offs when deciding the best way to plan for high availability. Those trade-offs will consider factors such as the cost of potential system downtime, the cost of hardware and software, and the complexity of administration, and compare those costs with the benefits in terms of system availability. There is no one answer that suits everyone; the right answer for you depends on your own cost/benefit analysis.
In addition to the standard high-availability features listed above, MarkLogic Server provides high availability for content hosted on d-nodes with failover for forests. There are two types of failover: local-disk failover and shared-disk failover. Local-disk failover allows you to specify replica forests on another host to fail over to, and shared-disk failover allows you to specify failover hosts for hosts that have forest data, and the failover host takes over the role of another host in the event of its becoming unavailable. Both types of failover detect when a host is down and perform the needed changes to have another host take over. For more details about failover, see High Availability of Data Nodes With Failover.
As your database size grows, having more memory on the system will generally make performance faster. MarkLogic Server caches content, indexes, and other information in memory, as well as memory maps selected indexes, including range indexes. Adding more memory allows you to raise memory settings (initially set during installation based on the memory on the machine), which can greatly improve performance. While memory can be relatively expensive, it does tends to have a good price/performance ratio.
MarkLogic Server is also designed to take advantage of multiple processor cores and hardware threads (hyperthreads).
More cores and threads support more system scalability, allowing you to manage more forests on a host, allow a host to process more queries simultaneously, and allow greater concurrency for a wide range of system activities.
The faster and more scalable the hardware is, the faster certain long running or recurring administrative tasks can complete. For example, MarkLogic Server periodically merges forest content, removing obsolete versions of the content. Additionally, reindexing a database can be a resource intensive operation. These types of administrative tasks benefit from fast, scalable hardware with plenty of memory.