MarkLogic Server applications can be very disk-intensive in their system demands. It is therefore very important to size your hardware appropriate for your workload and performance requirements. The topic of disk storage performance is complicated; there are many factors that can influence performance including disk controllers, network latency, the speed and quality of the disks, and other disk technologies such as storage area networks (SANs) and solid state drive (SSD). As with most performance issues, there are price/performance trade-offs to consider.
For example, SSDs are quite expensive compared with rotating drives. Conversely, HDFS (Hadoop Distributed Filesystem) or Amazon S3 (Simple Storage Service) storage can be quite inexpensive, but might not offer all of the speed of conventional disk systems.
In the forest configuration for each forest, you can configure a Fast Data Directory. The Fast Data Directory is designed for fast filesystems such as SSDs with built-in disk controllers. The Fast Data Directory stores the forest journals and as many stands as will fit onto the filesystem; if the forest never grows beyond the size of the Fast Data Directory, then the entire forest will be stored in that directory. If there are multiple forests on the same host that point to the same Fast Data Directory, MarkLogic Server divides the space equally between the different forests.
When the Fast Data Directory begins to approach its capacity, during periodic merges, MarkLogic Server will start to put data in the regular Data Directory. By specifying a Fast Data Directory, you can get much of the advantage of using the fast disk hardware while only buying a relatively small SSD (or other fast disk system). For example, consider a scenario where you have an 8-core MarkLogic Server d-host that is hosting 4 forests. If you have good quality commodity server-class rotating disk system with many magnetic disk spindles (for example, 6 disks in some RAID configuration) having 2 terabytes of storage, and if you have a 250 gigabyte SSD (for example, a PCI I/O accelerator card) for the fast data directory, then you can get a significant amount of the benefit of having the SSD storage while keeping the cost down (because the rotating storage is several times less expensive than the SSD storage). In this scenario, each of the 4 forests could use up to 1/4 of the size of the SSD, or about 62.5 GB. Once the forest size grows close to that limit, then the Data Directory with the rotating storage is used.
Just like you might want a different class of disk for the Fast Data Directory, you might also want a different class of disk for the Large Data Directory. The Large Data Directory stores binary document that are larger than the Large Size Threshold specified in the database configuration. This filesystem is typically a very large filesystem, and it may use a different class of disk than your regular filesystem (or it may just be on a different set of the same disks). For more details about binary documents, see Working With Binary Documents in the Application Developer's Guide.
HDFS (Hadoop Distributed Filesystem) and Amazon S3 (Simple Storage Service) storage represent two approaches to large distributed filesystems, and it is possible to use both HDFS and S3 to store MarkLogic forest data. This section describes considerations for using HDFS and S3 for storing forest data in MarkLogic and contains the following topics:
HDFS is a storage solution that uses Hadoop to manage a distributed filesystem. Hadoop has tools to specify how many copies of each file are replicated on how many different servers. HDFS gives you a high degree of control over your filesystem, as you can choose the disks to use, the computers to use, as well as configuration settings such as number of copies to replicate.
Internally, MarkLogic Server uses JNI to access HDFS. When you specify an HDFS path for one of the data directories, MarkLogic will write the forest data directly to HDFS according to the path specification.
By default, MarkLogic looks for Java in the location specified via the
JAVA_HOME environment variable or in specific set of default locations. If
JAVA_HOME is not set in the startup environment, then MarkLogic uses the first JRE or JDK found in one of the following locations. These locations are searched in the order listed.
*where N is a supported Java version. For example,
/usr/java/jdk1.7.0_79qualifies if Java 7 is a supported Java version.
.0-*.x86_64where N is a supported Java version, such as Java 7.
Though MarkLogic does not ship with HDFS client libraries, you can download client library bundles from http://developer.marklogic.com/products/hadoop.
/space. For example, if you download the HDP version 2.2 bundle to
/opt, then the following commands unpack the bundle to
/etc/hadoop/conf/. You must include at least your
log4j.propertiesconfiguration file in this location.
For more information on Hadoop and HDFS, see the Apache Hadoop documentation.
S3 is a cloud-based storage solution from Amazon. S3 is like a filesystem, but you access it via HTTP. MarkLogic Server uses HTTP to access S3, and you can put an S3 path into any of the data directory specifications on a forest, and MarkLogic will then write to S3 for that directory. For more details about Amazon S3, see the Amazon web site http://aws.amazon.com/s3/. This section describes S3 usage in MarkLogic and includes the following parts:
Storage on S3 has an 'eventual consistency' property, meaning that write operations might not be available immediately for reading, but they will be available at some point. Because of this, S3 data directories in MarkLogic have a restriction that MarkLogic does not create Journals on S3. Therefore, MarkLogic recommends that you use S3 only for backups and for read-only forests, otherwise you risk the possibility of data loss. If your forests are read-only, then there is no need to have journals.
When you set up an S3 path as a forest directory, the path must be readable and writable by the user in which the MarkLogic Server process is running. Typically, this means you must set Upload/Delete, View Permissions, and Edit Permissions on the AWS S3 bucket. This is true for both forest paths and for backup paths.
The S3 creadentials for a MarkLogic cluster are stored in the security database for the cluster. You can only have one set of S3 credentials per cluster. You can set up security access in S3, you can access any paths that are allowed access by those credentials. Because of the flexibility of how you can set up access in S3, you can set up any S3 account to allow access to any other account, so if you want to allow the credentials you have set up in MarkLogic to access S3 paths owned by other S3 users, those users need to grant access to those paths to the AWS Access Key set up in your MarkLogic Cluster.
To set up the AW credentials for a cluster, enter the keys in the Admin Interface under Security > Credentials. You can also set up the keys programmatically using the following Security API functions:
If you are using remote machine file paths on Windows (for example, a path like
machine-name is the name of the host and
dir is the path it exposes as a share), you must set the following registry settings to ZERO, as shown in https://technet.microsoft.com/en-us/library/ff686200.aspx: