Query Performance and Tuning Guide (PDF)

MarkLogic 9 Product Documentation
Query Performance and Tuning Guide
— Chapter 6

« Previous chapter
Next chapter »

Disk Storage Considerations

This chapter describes how disk storage can affect performance of MarkLogic Server, and some of the storage options available for forests. It includes the following sections:

Disk Storage and MarkLogic Server

MarkLogic Server applications can be very disk-intensive in their system demands. It is therefore very important to size your hardware appropriate for your workload and performance requirements. The topic of disk storage performance is complicated; there are many factors that can influence performance including disk controllers, network latency, the speed and quality of the disks, and other disk technologies such as storage area networks (SANs) and solid state drive (SSD). As with most performance issues, there are price/performance trade-offs to consider.

For example, SSDs are quite expensive compared with rotating drives. Conversely, HDFS (Hadoop Distributed Filesystem) or Amazon S3 (Simple Storage Service) storage can be quite inexpensive, but might not offer all of the speed of conventional disk systems.

Fast Data Directory on Forests

In the forest configuration for each forest, you can configure a Fast Data Directory. The Fast Data Directory is designed for fast filesystems such as SSDs with built-in disk controllers. The Fast Data Directory stores the forest journals and as many stands as will fit onto the filesystem; if the forest never grows beyond the size of the Fast Data Directory, then the entire forest will be stored in that directory. If there are multiple forests on the same host that point to the same Fast Data Directory, MarkLogic Server divides the space equally between the different forests.

When the Fast Data Directory begins to approach its capacity, during periodic merges, MarkLogic Server will start to put data in the regular Data Directory. By specifying a Fast Data Directory, you can get much of the advantage of using the fast disk hardware while only buying a relatively small SSD (or other fast disk system). For example, consider a scenario where you have an 8-core MarkLogic Server d-host that is hosting 4 forests. If you have good quality commodity server-class rotating disk system with many magnetic disk spindles (for example, 6 disks in some RAID configuration) having 2 terabytes of storage, and if you have a 250 gigabyte SSD (for example, a PCI I/O accelerator card) for the fast data directory, then you can get a significant amount of the benefit of having the SSD storage while keeping the cost down (because the rotating storage is several times less expensive than the SSD storage). In this scenario, each of the 4 forests could use up to 1/4 of the size of the SSD, or about 62.5 GB. Once the forest size grows close to that limit, then the Data Directory with the rotating storage is used.

Large Data Directory on Forests

Just like you might want a different class of disk for the Fast Data Directory, you might also want a different class of disk for the Large Data Directory. The Large Data Directory stores binary document that are larger than the Large Size Threshold specified in the database configuration. This filesystem is typically a very large filesystem, and it may use a different class of disk than your regular filesystem (or it may just be on a different set of the same disks). For more details about binary documents, see Working With Binary Documents in the Application Developer's Guide.

HDFS, MapR-FS, and S3 Storage on Forests

HDFS (Hadoop Distributed Filesystem) and Amazon S3 (Simple Storage Service) storage represent two approaches to large distributed filesystems, and it is possible to use both HDFS and S3 to store MarkLogic forest data. This section describes considerations for using HDFS and S3 for storing forest data in MarkLogic and contains the following topics:

Both HDFS and S3 can be very useful when implementing a tiered storage solution. For details on tiered storage, see Tiered Storage in the Administrator's Guide.

HDFS Storage

HDFS is a storage solution that uses Hadoop to manage a distributed filesystem. Hadoop has tools to specify how many copies of each file are replicated on how many different servers. HDFS gives you a high degree of control over your filesystem, as you can choose the disks to use, the computers to use, as well as configuration settings such as number of copies to replicate. MarkLogic can use Kerberos Secured HDFS as a file system on Linux platforms, as described in Kerberos Authentication for Secured HDFS in the Security Guide.

HDFS storage is supported with MarkLogic on the following HDFS platform:

  • Cloudera CDH version 5.8
  • Hortonworks HDP version 2.6

Internally, MarkLogic Server uses JNI to access HDFS. When you specify an HDFS path for one of the data directories, MarkLogic will write the forest data directly to HDFS according to the path specification.

When you set up an HDFS path as a forest directory, the path must be readable and writable by the user in which the MarkLogic Server process is running.

Because you can set up HDFS as a very large shared filesystem, it can be good not only for forest data, but as a destination for database backups.

An HDFS path is of the following form:

hdfs://<machine-name>:<port>/directory

so the following path would be to an hdfs filesystem accessed on a machine named raymond.marklogic.com on port 12345:

hdfs://raymond.marklogic.com:12345/directory

Each MarkLogic host that uses HDFS for forest storage requires access to the following:

  • The Oracle/Sun Java JDK (or an Oracle/Sun JRE that includes JNI)
  • Hadoop HDFS client JAR files
  • Your Hadoop HDFS configuration files

The following HDFS configuration property settings are required:

  • dfs.support.append: true. This is the default value.
  • dfs.namenode.accesstime.precision: 1

The remainder of this section describes how to configure your hosts so that MarkLogic can find these components.

For details on the supported Java versions and how MarkLogic locates a JRE, see Java Virtual Machine Requirements in the Installation Guide.

Though MarkLogic does not ship with HDFS client libraries, you can download client library bundles from http://developer.marklogic.com/products/hadoop.

Follow this procedure to make the bundled libraries and configuration files available to MarkLogic Server. You must follow this procedure on each MarkLogic host that uses HDFS for forest storage.

  1. Download the Hadoop client bundle that corresponds to your Hadoop distribution from http://developer.marklogic.com/products/hadoop.
  2. Unpack the bundle to one of the following locations: /usr, /opt, /space. For example, if you download the HDP bundle for MarkLogic 9.0-7 to /opt, then the following commands unpack the bundle to /opt.
    cd /opt
    gunzip hadoop-hdfs-hdp-9.0-7.tar.gz
    tar xf hadoop-hdfs-hdp-9.0-7.tar

    The bundle unpacks to a directory named hadoop, so the above commands create /opt/hadoop/. The version portion of your bundle download filename may differ.

  3. Make your Hadoop HDFS configuration files available under /etc/hadoop/conf/. You must include at least your log4j.properties configuration file in this location.
  4. Ensure the libraries and config files are readable by MarkLogic.

For more information on Hadoop and HDFS, see the Apache Hadoop documentation.

S3 Storage

S3 is a cloud-based storage solution from Amazon. S3 is like a filesystem, but you access it via HTTP. MarkLogic Server uses HTTP to access S3, and you can put an S3 path into any of the data directory specifications on a forest, and MarkLogic will then write to S3 for that directory. For more details about Amazon S3, see the Amazon web site http://aws.amazon.com/s3/. This section describes S3 usage in MarkLogic and includes the following parts:

S3 and MarkLogic

Storage on S3 has an eventual consistency property, meaning that write operations might not be available immediately for reading, but they will be available at some point. Because of this, S3 data directories in MarkLogic have a restriction that MarkLogic does not create Journals on S3. Therefore, MarkLogic recommends that you use S3 only for backups and for read-only forests, otherwise you risk the possibility of data loss. If your forests are read-only, then there is no need to have journals.

When you set up an S3 path as a forest directory, the path must be readable and writable by the user in which the MarkLogic Server process is running. Typically, this means you must set Upload/Delete, View Permissions, and Edit Permissions on the AWS S3 bucket. This is true for both forest paths and for backup paths.

Because S3 is a very large shared filesystem, it can be good not only for forest data, but as a destination for database backups.

To specify an S3 path in MarkLogic, use a URL of the following form:

s3://<bucket-name>/<path-to-location>

so the following path would be to an S3 filesystem with a bucket named my-bucket and a path named my-directory:

s3://my-bucket/my-directory

Amazon has other ways to set up S3 URLs, but use the form above to specify the S3 paths in MarkLogic; for more information on S3, see the Amazon documentation.

Entering Your S3 Credentials for a MarkLogic Cluster

S3 requires authentication with the following S3 creadentials:

  • AWS Access Key
  • AWS Secret Key

The S3 creadentials for a MarkLogic cluster are stored in the security database for the cluster. You can only have one set of S3 credentials per cluster. You can set up security access in S3, you can access any paths that are allowed access by those credentials. Because of the flexibility of how you can set up access in S3, you can set up any S3 account to allow access to any other account, so if you want to allow the credentials you have set up in MarkLogic to access S3 paths owned by other S3 users, those users need to grant access to those paths to the AWS Access Key set up in your MarkLogic Cluster.

To set up the AW credentials for a cluster, enter the keys in the Admin Interface under Security > Credentials. You can also set up the keys programmatically using the following Security API functions:

The credentials are stored in the Security database. Therefore, you cannot use S3 as the forest storage for a security database.

Windows Shared Disk Registry Settings and Permissions

If you are using remote machine file paths on Windows (for example, a path like \\machine-name\dir, where machine-name is the name of the host and dir is the path it exposes as a share), you must set the following registry settings to ZERO, as shown in https://technet.microsoft.com/en-us/library/ff686200.aspx:

  • FileInfoCacheLifetime
  • FileNotFoundCacheLifetime
  • DirectoryCacheLifetime

These DWORD registry keys settings are in the following registry:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanworkstation\Parameters

Additionally, the directory path must have read and write permissions for the SYSTEM user, or whichever user under which MarkLogic.exe runs.

« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy