MarkLogic supports large-scale high-performance architectures through multi-host distributed architectures. These architectures introduce additional complexity to both the planning and the deployment processes. This chapter introduces key concepts and terminology, outlines some alternative distribution strategies, and provides a high-level guide to configuring a cluster from scratch. The following topics are included:
forest A forest is a repository for documents. Each forest is managed by a single host. The mapping of which forest is managed by which host is transparent to queries, as queries are processed against databases, not forests. Hosts can manage more than one forest simultaneously.
database A database is a set of one or more forests that appears as a single contiguous set of content for query purposes. Each forest in a database must be configured consistently. HTTP, WebDAV, and XDBC servers evaluate queries against a single database. In addition to databases created by the administrator for user content, MarkLogic Server maintains databases for administrative purposes: security databases, which contain user authentication and permissions information; schema databases, which are used to store schemas used by the system; modules databases, which are used to store executable XQuery code; and triggers databases, used to store trigger definitions. There are backup and restore utilities at both the forest and the database level.
collection A collection is a subset of one or more documents within a database. Documents can belong to more than one collection simultaneously. Collections can span multiple forests, and are not relevant to the cluster configuration process.
The following diagram illustrates how the concepts of clusters, hosts and groups are implemented in an example multi-host distributed architecture. The diagram shows a single cluster involving six hosts that are segmented into two groups:
This second diagram shows how a single database called 'Documents' could comprise four forests spread across the three hosts in the group 'Data-Nodes' from the above diagram. The diagram also illustrates two internal databases maintained by the system for administrative usage:
Because MarkLogic Server takes advantage of platform-specific optimizations, it is not possible to mix and match platforms in a single cluster. This means that you will need to consider the pros and cons of different CPU architectures and operating systems across all hosts in your cluster when planning a multi-host deployment.
Not only must clusters be platform homogeneous, the internal data formats used by forests are platform-specific. Consequently, forests and forest backups are not portable across platforms. To move database or forest content from a MarkLogic Server implementation running on one platform to an implementation running on another platform, the database must be unloaded from the first system and loaded from scratch into the second database.
Even though each host in a cluster can be configured to perform a different task, the full MarkLogic Server software runs on each host. Consequently, software installation is identical across the nodes. However, hosts can be reconfigured rapidly--on an individual or cluster-wide basis--to perform whatever role is required of them. Of course, it is quite possible that different hardware configurations will be used for different hosts in the cluster, depending on the roles they are expected to play.
Multi-host deployments may incorporate multiple hosts running the same HTTP, WebDAV, or XDBC server applications, allowing queries to be distributed between those hosts for improved performance. MarkLogic Server transparently manages the distribution of content storage across forests within databases and the distribution of query execution across the hosts managing those forests. However, MarkLogic Server does not manage the distribution of queries across front-end hosts configured with equivalent HTTP, WebDAV, or XDBC server applications. The multi-host deployment relies on surrounding infrastructure for this load-balancing task.
Many distributed architectures rely on a single designated 'master' to direct and manage the operations of the cluster. In MarkLogic's distributed architecture, all hosts are considered equal, regardless of the roles for which they are configured. Consequently, there is no designated 'master' in the cluster.
With no designated 'master' in the cluster, administrative operations can be conducted on any cluster host configured to offer browser-based administrative operations or to accept XQuery-based administrative directives.
All queries that need to check for content stored in that forest or retrieve content from that forest will be routed to that host for processing. If you are using local-disk failover, you can replicate a forest across multiple hosts (for details, see High Availability of Data Nodes With Failover).
MarkLogic Server Standard Edition provides a powerful workgroup or departmental content platform. However, its single server architecture can constrain its scalability and performance in several ways:
In addition, the MarkLogic distributed architecture enables significantly more cost-effective implementations. MarkLogic allows data centers to distribute workload across a number of less-expensive entry-level servers or bladed infrastructures rather than requiring an investment in large expensive SMP systems.
Just as there are different reasons to deploy a distributed architecture, there are different issues to consider in determining a distribution strategy. Each strategy provides certain benefits and involves certain other compromises. In order to determine how best to deploy MarkLogic Server to meet your requirements, you should obtain assistance from MarkLogic's professional services group.
A first-cut approach to small-scale distribution will have a number of hosts, each providing query evaluation and data service capability. This model can work well it static environments, and when the anticipated loads for each role map well.
In more dynamic environments, when the scale of the environment relative to individual host capacity is large, or when the anticipated scales of the query evaluation and data service loads differ markedly, the recommended approach is to partition the hosts into different roles. This architecture makes it easier to scale role-based capacity independently, and can also have some configuration advantages when different services operating at different scales are being offered out of the same cluster.
MarkLogic Server can be deployed on a small number of relatively larger-scale systems, or a larger number of relatively small-scale systems. Data center platform and system standards will be the primary driver for this architecture decision.
There is no requirement that all systems in a cluster be of the same scale. In fact, as outlined below, there are potential tuning advantages to be obtained from customizing systems within the cluster based on the role played by the hosts operating on that system.
In deployments architected following a distinct role strategy, it is to be expected that the data service nodes would be configured differently from the query evaluation nodes. The configuration differences will begin with storage connectivity, but will likely extend to memory configuration and possibly to processor count.
Storage that is locally attached to each host offering data service can be cost- and performance-effective, but difficult to manage from a data center viewpoint. Centralized storage options include both NAS (network attached storage) and SAN (storage area network) approaches, with their own distinct trade-offs. Consult your MarkLogic professional services consultant for detailed deployment architecture advice.
When first configuring and deploying a cluster, the best place to start is with the overall architecture and distribution strategy. With the right strategy in hand, you can rapidly develop the actual configuration parameter sets that you will need. This section includes the following topics:
The purpose of this process guide is to outline the overall workflow involved in setting up a cluster. For detailed explanations of each step below, see the Installation Guide. The following are the steps to building a cluster:
The system on which you install the first host of the cluster will be the system on which the cluster's default Security and Schema databases will reside. Consequently, do not choose a system on whose role is solely a query evaluator.
At this point, your cluster should be ready to load content and evaluate queries. See the Administrator's Guide for more detailed information on any of the configuration steps outlined above.
Adding a new host to an existing cluster is a simple task, outlined in the section Configuring an Additional Host in a Cluster in the Installation Guide. Once the new host has been added, you may need to reconfigure how hosts in your cluster are used to redistribute workload. See the Administrator's Guide for the appropriate procedures.