Loading TOC...
MarkLogic Server on Amazon EC2 Guide (PDF)

MarkLogic Server on Amazon EC2 Guide — Chapter 1

Overview of MarkLogic Server on EC2

This chapter provides an overview of MarkLogic Server on Amazon Elastic Compute Cloud (EC2) using a MarkLogic Amazon Machine Image (AMI), as well as how to create an Amazon EC2 account and order a MarkLogic Server for EC2 AMI. This chapter includes the following sections:

For more detailed information on Amazon EC2, see the Amazon documentation located at the following URL:


STOP: Before you do Anything!

There are multiple ways to launch a MarkLogic AMI to create a MarkLogic cluster or a single MarkLogic instance in the AWS environment. However, before you explore any alternatives, it is recommended that you first launch your MarkLogic AMI using a CloudFormation template and follow the procedures described in this guide. For details on how to launch a MarkLogic AMI using a CloudFormation template, see Deploying MarkLogic on EC2 Using CloudFormation. The MarkLogic CloudFormation templates are available from http://developer.marklogic.com/products/aws.

MarkLogic now supports the 1-Click Launch option in AWS Marketplace. Because of this, the published MarkLogic AMIs will have data volume predefined.

Should you later choose not to launch your MarkLogic AMI by means of a CloudFormation template, you will not have automatic access to the Managed Cluster features described in The Managed Cluster Feature. You can still launch an AMI, but you will need to follow the steps outlined in Launching a MarkLogic AMI outside of CloudFormation.

Understanding MarkLogic Server for EC2

MarkLogic provides pre-packaged AMIs containing Amazon Linux and MarkLogic Server. MarkLogic has included scripts on these AMIs that simplify the steps necessary to get your MarkLogic Server instances up and running.

This section describes:

Amazon EC2 Terminology

The following are the definitions for the terms used in this guide:

Elastic Compute Cloud (EC2) is a web service that enables you to launch and manage server instances in Amazon's data centers using APIs or available tools and utilities. The Amazon EC2 website is available at: http://aws.amazon.com/ec2/.

CentOS is a community-supported, free and open source operating system based on Red Hat Enterprise Linux.

An Elastic Load Balancer (ELB) is a service that automatically distributes and balances application traffic among multiple EC2 instances. For details, see http://docs.aws.amazon.com/gettingstarted/latest/wah/getting-started-create-lb.html.

Amazon Machine Image (AMI) is an encrypted machine image that contains all information necessary to boot instances of software. Instances of MarkLogic Server are created from the stock Amazon Linux AMI and have been pre-installed with MarkLogic and the necessary dependancies.

Type Pricing
Production Per-hour EC2 premium charged
Bring Your Own License (BYOL) No additional charge

Elastic Block Store (EBS) is a type of storage designed specifically for Amazon EC2 instances. Amazon EBS allows you to create volumes that can be mounted as devices by Amazon EC2 instances. Amazon EBS volumes behave like raw unformatted external block devices. They are attached to user-specified block devices and provide a block device interface. You can load a file system on top of Amazon EBS volumes, or use them just as you would use a block device. Amazon EBS volumes exist separately from the actual instances and persist until you delete them. This allows you to store your data without leaving an Amazon EC2 instance running. Each Amazon EBS volume can be up to one TiB in size.

An Instance is the running system after an AMI is launched. Instances remain running unless they fail or are terminated. When this happens, the data on the instance is no longer available. Once launched, an instance looks very much like a traditional host.

Amazon Web Services (AWS) is the Amazon Cloud Computing service. For details, see http://aws.amazon.com/.

AWS Cloud Storage (S3) is an Amazon web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. For details, see Configuring MarkLogic for Amazon Simple Storage Service (S3) and https://aws.amazon.com/s3/.

Cloud Formation (CF) is the AWS Cloud Formation service for provisioning startup of AWS resources. For details, see Deploying MarkLogic on EC2 Using CloudFormation and http://aws.amazon.com/cloudformation/. The MarkLogic CloudFormation templates are available from http://developer.marklogic.com/products/aws.

Managed Clusters is a MarkLogic feature that works with AWS features to automatically create and provision the necessary AWS resources and provide MarkLogic with the information needed to manage your cluster. For details, see The Managed Cluster Feature.

MarketPlace is the AWS service for publishing pay-per-use and free (no extra charge) public AMI's on amazon. For details, see https://aws.amazon.com/marketplace.

An Instance Type defines the size of an Amazon EC2 instance. The MarkLogic Server instance types are shown in the table at the end of 5 in Creating a CloudFormation Stack using the AWS Console.

An Instance Store (sometimes referred to as Ephemeral Storage) is a fixed amount of storage space for an instance. An instance store is not designed to be a permanent storage solution. If an instance reboots, either intentionally or unintentionally, the data on the instance store will survive. If the underlying drive fails or the instance is terminated, the data will be lost.

An EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

Metadata Database is the database that stores and indexes all of the configuration data required to manage a cluster of one or more MarkLogic Servers. For AWS, the DynamoDB service is used to implement the Metadata Database. For details, see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStartedDynamoDB.html.

The Managed Cluster Feature

Running MarkLogic Server in AWS has some challenges that you may not experience in traditional IT data centers. The Managed Clusters feature helps you mitigate these challenges with support for reliability, scalability and high availability, as well as with tools that automatically handle some of the more problematic issues. It is highly recommended that you use the CloudFormation templates and follow the procedures in this guide to launch your MarkLogic AMI in AWS as they are provided to help you leverage AWS and MarkLogic features especially designed for a reliable and easy cloud deployment.

On AWS, the following should be considered before deploying MarkLogic nodes or clusters:

  • Instance Hostnames -- An EC2 instance starts life with a unique hostname. If this instance is stopped for any reason (such as a hardware/software failure or manual stopping to avoid EC2 charges), when it restarts it will start with a new hostname. This causes configuration problems, especially in clusters, but also in single-node installations. If you stop a cluster then bring it back online without reconfiguring it, the nodes will not rejoin the cluster because all the hostnames are different.

    The Managed Cluster feature automatically detects hostname changes and propagates the changes throughout the cluster.

  • Transient Data -- When an EC2 instance is terminated, the root volume is released and all of the data on it is lost. This includes any pre-installed software or OS configurations. MarkLogic installations should not be put on the root volume, but rather on an attached EBS volume.

    The Managed Cluster feature automatically keeps track of each EBS volume, along with its related EC2 instance and mount directory. When you restart your EC2 instances, the Managed Cluster feature automatically re-attached and mounts your volumes to the appropriate locations to ensure that your Forests and Databases are intact.

  • AWS Security -- If you want to access any AWS services from within an EC2 node you need authentication. One way to do this is to provide your AWS credentials and store them on the EC2 instance. However, this is both inconvenient and a security risk. The Managed Cluster feature and CloudFormation templates use IAM Roles so that you never need to expose your AWS Credentials.
  • Securing your Infrastructure -- EC2 instances by default are open to the world. You need to consider your security requirements before launching an EC2 instance and putting sensitive data on it. The Cloud Formation Templates create baseline security groups to allow easy startup. You can edit these security groups to your particular security needs.
  • Volatile Instances -- EC2 instances can fail at any time. This is true of on-premise hardware as well, but in the cloud you are not there to fix this yourself. Instead you need to rely on AWS features to restart failed instances to keep your database up and running at all times. The CloudFormation templates create a suggested topology with AutoScaling groups across zones as needed that distribute your cluster and restart unhealthy instances automatically. The Managed Cluster support makes sure that restarted instances rejoin the cluster.
  • Datacenter Failures -- Datacenters can and do fail. Often do to unpredictable issues, like hurricanes or earthquakes. If you want high availability you need to plan for failure by distributing your cluster across datacenters, so that any single failure does not take your cluster offline. The CloudFormation templates create a suggested topology with AutoScaling groups distributed across zones in a single region for maximum fault tolerance.
  • Load Balancing and Routing -- Your MarkLogic cluster should only be accessible when it is healthy. Use of AWS ELBs is highly recommended even for single nodes, but especially when running a cluster. The ELBs not only balance traffic across healthy nodes, but automatically notify the AutoScaling groups when a node is unhealthy, even if the hardware and OS are running fine, so that traffic is diverted to healthy nodes and the unhealthy nodes are terminated and restarted.

    The Managed Cluster feature provides a Health Check application on each server that is used by the Load Balancer to detect if your MarkLogic instance is healthy.

    If you are using a XCC client (such as mlcp) with MarkLogic running on AWS, you must enable the xcc.httpcompliant setting to work with AWS ELBs. For details, see Using a Load Balancer or Proxy Server with an XCC Application in the XCC Developer's Guide.

  • Scaling Up and Down -- Sometimes you want more capacity sometimes you want less. Cloud Computing provides the raw tools to enable scaling up and down, but your software needs to understand and integrate with provisioning changes. The CloudFormation templates allow an easy one-step process for scaling your cluster capacity up and down.
  • License Application on a Cluster -- When launching a cluster, you traditionally need to apply your license to each node manually after they come up for the first time. The Managed Cluster feature automatically applies your license keys to all of the nodes in the cluster.
  • Cluster Formation -- When running a MarkLogic cluster, you need to configure the first node with your Administrator credentials, then configure remaining nodes to connect to the cluster. The Managed Cluster feature automatically handles this for you so clusters (even clusters of one) come up ready to run without further manual intervention.
  • Pausing Clusters -- A great benefit of Cloud Computing is that you only pay for what you use. If you are running a development server, or a site that doesn't need to be running 24/7, you can pause your entire cluster so you don't incur EC2 charges while it is not running. The Managed Cluster feature along with CloudFormation enable you to quickly restart the cluster and have your resources re-attach to all of your data, so that your cluster will be up and running where left off.

CloudFormation is not required to make use of the Managed Clusters feature, instead you can choose to manually or programmatically configure the AWS resources using other tools, but it is a challenging task without strong cloud orchestration and management tools. CloudFormation allows you to both document and implement a managed cluster configuration using a simple declarative template that can grow with your needs.

Running MarkLogic without the Managed Clusters feature is also supported (with or without our provided AMI's) and is the simplest configuration. However it is also the least reliable and is not recommended.

Launching a MarkLogic AMI outside of CloudFormation

This section describes the minimum steps you need to take should you insist on running MarkLogic without either CloudFormation or your own or third party cloud management tools. These steps do not enable the Managed Cluster feature and are not recommended.

  • Launch an EC2 AMI, either one we provide in MarketPlace or your own AMI on which you have pre-installed MarkLogic, following the OS-specific installation instructions.
  • Create a Key Pair, as described in Creating a Key Pair. You will need the Key Pair name to launch the instance.
  • Create a Security Group that opens at port 22 (for ssh) and ports (8000-8002) for the Admin API. You may also need to configure the Security Group to open additional ports for your own applications or for foreign clusters.
  • Create an EBS volume for your data and attach it to the logical device, /dev/sdf. On startup, MarkLogic will detect this volume (or wait for it to be attached) then create a filesystem and mount it as /var/opt/MarkLogic. The MarkLogic AMI will have the data volume pre-defined and it must be associated with /dev/sdf.

You may now go to the admin page on port 8001 and continue configuring your MarkLogic instance.

HealthCheck App Server

The Elastic Load Balancer (ELB) periodically sends a heartbeat to each of its instances to monitor their health. Each instance of MarkLogic Server has a HealthCheck app server on port 7997. The ELB cannot be configured with authentication, so the URL for the HealthCheck App Server does not require authentication.

Typical Architecture

This section describes some of the typical configurations of a MarkLogic cluster in an EC2 environment.

As described in the Scalability, Availability, and Failover Guide, Evaluator Nodes (E-Nodes) perform data processing operations including aggregates, computations (including user defined functions). Data Nodes (D-Nodes) manage the forest data operations. E-Nodes can be grouped separately from D-Nodes in a security group, which might be preferable for some deployments.

End-user or app-level queries should to be routed to the E-Nodes through a load balancer. You can add E-Nodes to scale up a cluster to handle more queries, more users and more computation.

To ensure high availability, place D-Nodes in different availability zones in the same region and configure them for local-disk failover to ensure that each transaction is written to one or more replicas. In EC2, the latency between zones in same region is low (approximately two milliseconds). For optimum availability, D-Nodes and E-Nodes should be split evenly between two availability zones. For disaster recovery, you can place D-Nodes in different regions and use database replication between the D-Nodes in each region.

The recommended storage resources are EBS volumes for forests and S3 for backups. All volumes should be formatted with 16K blocks. This is optimized for MarkLogic's large sequential IO profile and also aligns to Amazon's pIOPS implementation. Each configured forest on MarkLogic requires a minimum of 20mb/sec. 20mb/sec with 16K blocks is 1280 IOPS. Each instance of MarkLogic Server should be configured with a maximum of 5 Hi-IO volumes/forests. Additional EBS for boot and low-IO forests, such as those used by the Security and Schema databases, can be added. Once the forests are on provisioned IOPS volumes, they can be migrated during maintenance periods or via scripted migration while running using the replicas.

The hi1.4xlarge instance offers the fastest performance. MarkLogic can take advantage of SSD volumes to accelerate all operations. However, because SSD volumes are ephemeral, database replication is required. Alternatively, m2.4xlarge offers a large amount of RAM and relatively high IO along with 1GB/sec IO guaranteed. m1.xlarge also offers that IO but with only 15GB/RAM, much less cache.

The illustration below shows a possible architecture, where there are two clusters on different zones in the same region. The D-nodes in each cluster are configured for local disk failover. Forest data is stored in IOPS EBS volumes and backups and snapshots of EBS volumes are stored in S3.

The illustration below shows a possible architecture, where two clusters are configured with database replication and deployed in different regions. This type of architecture works with Amazon Route 53, which is a scalable Domain Name System (DNS) web service that provides secure and reliable routing of user traffic to your MarkLogic cluster. Should the Master cluster fail, Route 53 can redirect traffic to the Replica cluster in seconds. The state of the Replica will lag somewhat from the Master (5-20 seconds is typical).

For details on Amazon Route 53, see http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Welcome.html.

« Table of contents
Next chapter »