This section describes forests in the MarkLogic Server, and includes the following sections:
This chapter describes how to use the Admin Interface to manage forests. For details on how to manage forests programmatically, see Creating and Configuring Forests and Databases and Database Maintenance Operations in the Scripting Administrative Tasks Guide.
A forest is a collection of XML, JSON, text, or binary documents. Forests are created on hosts and attached to databases to appear as a contiguous set of content for query purposes. A forest can only be attached to one database at a time. You cannot load data into a forest that is not attached to a database.
A forest contains in-memory and on-disk structures called stands. Each stand is composed of XML, JSON, binary, and/or text fragments, plus index information associated with the fragments. When fragmentation rules are in place, XML documents may span multiple stands. MarkLogic Server periodically merges multiple stands into a single stand to optimize performance. See Understanding and Controlling Database Merges for details on merges.
A forest also contains a separate on-disk Large Data Directory for storing large objects such as large binary documents. MarkLogic Server stores large objects separately to optimize memory usage, disk usage, and merge time. A small object is stored directly in a stand as a fragment. A large object is stored in a stand as a small reference fragment, with the full content stored in the Large Data Directory. The size threshold for storing objects in the Large Object Store and the location of the Large Object Store are configurable through the Admin Interface and Admin API. For details, see Working With Binary Documents in the Application Developer's Guide.
By default, the operations allowed on a forest are: read, insert, update, and delete. You can control which operations are allowed on a forest by setting the following update types:
Update Type | Description |
---|---|
All |
Read, insert, update, and delete operations are allowed on the forest. |
delete-only |
Read and delete operations are allowed on the forest, but insert and update operations are not allowed unless a forest ID is specified, in which case it results in the document being moved to another forest. If you do not specify a forest ID when updating a document in a delete-only forest, the update throws an exception. This update type is useful when you want to eliminate the overhead imposed by the merge operation, but still allow transactions to delete data from the forest. See Making a Forest Delete-Only for details. |
Read operations are allowed on the forest, but insert, update, and delete operations are not allowed. A transaction attempting to make changes to fragments in the forest will throw an exception. This update type is useful when you want to put your forests on read-only media and allow them to be queried. See Making a Forest Read-Only for details. | |
This type puts the forest in read-only mode without throwing exceptions on insert, update, or delete transactions, allowing the transactions to retry. This update type is useful when you want to temporarily quiesce a forest or to disable changes to the forest data when doing a flash backup of the forest. See Making a Forest Read-Only for details. |
To make the entire database read-only, set all of the forests in the database to read-only
.
To create a new forest, complete the following procedure:
The name of the forest is used by the system as a directory name. Therefore, the forest name must be a legal directory name and cannot contain any of the following 9 characters: \ * ? / : < > | "
. Additionally, the name cannot begin or end with a space or a dot (.
). MarkLogic recommends that you use an absolute path if you specify a data directory. If you do not specify an absolute path for the data directory, your forest will be created in the default data directory.
The directory you specified can be an operating system mounted directory path, it can be an HDFS path, or it can be an S3 path. For details on using HDFS and S3 storage in MarkLogic, see Disk Storage Considerations in the Query Performance and Tuning Guide.
The Forests directory is either a fully-qualified pathname or is relative to the Forests directory, set at installation time based on the directory in which MarkLogic Server is installed. The following table shows the default location Forest directory for each platform:
Platform | Program Directory |
Microsoft Windows | C:\Program Files\MarkLogic\Data\Forests |
Red Hat Linux | /var/opt/MarkLogic/Forests |
Mac OS X |
|
The Read-Only update types described in Making a Forest Read-Only can be set in the Configure page of an existing forest.
Creating a forest is a hot admin task; the changes take effect immediately. However, toggling between update types restarts the forest.
You can configure a forest to only allow read and delete operations, disallowing inserts and updates to any documents stored in the forest. A delete-only forest is useful in cases where you have multiple forests in a database and you want to manage which forests change. To set a forest to only allow delete operations (and disallow inserts and updates), navigate to the configuration page for the forest you want specify as delete-only and set the updates allowed
field to delete-only
.
When a forest is set to delete-only, updates to documents in a delete-only forest that do not specify a forest ID will throw an exception. Updates to documents in a delete-only forest that specify one or more forest IDs of other forests in the database will result in the documents moving to one of those other forests. When a document moves forests, the old version of the document will be marked as deleted, and will be removed from the forest during the next merge.
To specify an update that will move a document in a delete-only forest to an updateable forest, you must specify the forest ID of at least one forest in which updates are allowed. One technique to accomplish this is to always specify all of the forest IDs, as in the following xdmp:document-insert example which lists all of the forests in the database for the $forest-ids
parameter:
xdmp:document-insert($uri, $node, (), (), 0, xdmp:database-forests(xdmp:database()) )
You can only move a document from a delete-only forest to a forest that allows updates using an API that takes forest IDs, and then by explicitly setting the forest IDs to include one or more forests that allow updates. The node-level update built-in functions (xdmp:node-replace, xdmp:node-insert-child, and so on) do not have a forest IDs parameter and therefore do not support moving documents.
Under normal operating circumstances, you likely will not need to set a forest to be delete-only. Additionally, even if the reindexer is enabled at the database level, documents in a forest that is set to delete-only will not be reindexed.
There are cases where delete-only forests are useful, however. One of the use cases for delete-only forests is if you have multiple forests and you want to control when some forests are merging. The best way to control merges in a forest is to not insert any new content in the forest. In this scenario, you can set some of the forests to be delete-only, and then those forests will not merge during that time (unless you manually specify a merge, either with the xdmp:merge API or by clicking the Merge button in the Admin Interface). After a while, you can rotate which forests are delete-only. For example, if you have four forests, you can make two of them delete-only for one day, and then make the other two delete-only the next day, switching the first two forest back to allowing updates. This approach will only have two forests being updated (and periodically merging) at a time, thus needing less disk space for merging. For more details about merges, see Understanding and Controlling Database Merges.
You can configure an existing forest to only allow reads and to disallow inserts, updates and deletes to any documents stored in the forest.
MarkLogic Server supports two read-only forest settings:
read-only
-- When this update type is set, update transactions on the forest are immediately aborted. flash-backup
-- When this update type is set, update transactions on the forest are retried until either the update type is reset or the Default Time Limit set for the App Server is reached.Only existing forests can be set to read-only
or flash-backup
. You cannot create a new forest with these settings.
A read-only forest is useful if you want to put your forests on read-only media and allow them to be queried. Another use of read-only
is to control disk space. For example, in a multi-forest database, it might be useful to be able to mark one or more forests as read-only
as they reach disk space limits.
One use for flash-backup
is to prevent updates to the forest during a flash backup operation, which is a very fast backup that can be done on some file systems. You can set the flash-backup
update type to temporarily put the forest in read-only mode for the duration of a flash backup and then reset the update type when the backup has completed. Transactions attempting to make changes to the forest during the backup period are retried.
Toggling between read-only
or flash-backup
and other forest update types triggers a forest restart. This activity is visible in the log file.
When the read-only
or flash-backup
update type is set, the forest will have the following characteristics:
read-only
) or retried later (in the case of flash-backup
).The Forest Summary page lists all of the forests in the cluster, along with various information about each forest such as its status, which host is the primary host, and amount of free space for each forest. It also lists which database each forest is attached to, and allows you to attach and/or detach forests from databases. Alternately, you can use the Database Forest Configuration page to attach and detach a forest, as described in Attaching and/or Detaching Forests to/from a Database.
Perform the following steps using the Admin Interface to attach or detach one or more forests to or from a database:
If you change a database assignment from one database to another, it will detach the forest from the previous setting and attach it to the new setting. Be sure that is what you intend to do. Also, if you detach from one database and attach to another database with different index settings, the forest will begin reindexing if reindexer enable
is set to true
.
The forests you attached or detached are now reflected in the database configuration. Attaching and detaching a forest to a database are hot admin tasks.
MarkLogic Server backs up forest data by transactionally creating an image copy of a specified forest. You can back up data at the granularity of a forest or of a database. Use the Admin Interface to back up a forest.
Forest-level backups only back up the data in a forest, and are not guaranteed to have a consistent database state to restore. The data in the forest is consistent, but other parts of the database (other forests, the schema database, and so on) might be different when you restore the data. For a guaranteed consistent backup, perform a complete database backup For information on backing up a database, see Backing Up and Restoring a Database.
Forest backups do not provide a journal archive feature, as described for database backups in Backing Up and Restoring a Database. However, you can manually invoke the xdmp:start-journal-archiving function during a forest backup to make use of journal archiving with your forest backups.
This section describes the forest backup procedures, and includes the following parts:
To initiate a forest backup using the Admin Interface, complete the following procedure:
The software deletes all the files in this directory before writing the new backup. To retain multiple generations of backup, specify a different backup directory for each backup.
Your data in the selected forest is now backed up to the specified directory. Backing up your data is a hot admin task; the changes take effect immediately.
When performing backups on the Windows platform, ensure that no users have the Forests or Data directories (or any subdirectories within them) open while the backup is being made.
You can schedule forest backups to periodically back up a forest. You can schedule backups to occur daily, weekly, monthly, or you can schedule a one-time backup. You can create as many scheduled backups as you want. To create a scheduled backup, perform the following steps using the Admin Interface:
The backups will automatically start according to the specified schedule.
You can restore a forest from a backup made earlier either using the Admin Interface. Backups are restored at the forest granularity only.
To restore a forest from a backup made previously, complete the following procedure:
Restoring data from your backup is a hot admin task; the changes take effect immediately.
When performing restores on the Windows platform, ensure that no users have the Forests or Data directories (or any subdirectories within them) open while the restore process is executing.
You can use the xdmp:forest-rollback function to roll the state of one or more forests back to a specified system timestamp. To roll forest(s) back to an earlier timestamp, you must first set the merge timestamp to keep deleted fragments from that specified timestamp. For details on rolling back a forest, including the procedure to perform a rollback, see Rolling Back a Forest to a Particular Timestamp in the Application Developer's Guide and the xdmp:forest-rollback API documentation in the MarkLogic XQuery and XSLT Function Reference.
You can merge the forest data using the Admin Interface. As described in Understanding and Controlling Database Merges, merging a forest improves performance and is periodically done automatically in the background by MarkLogic Server. The Merge button allows you to explicitly merge the data for this forest.
To explicitly merge the forest, complete the following procedure:
Merging data in a forest is a hot admin task; the changes take effect immediately.
You can clear the document data from a forest using the Admin Interface. Clearing a forest removes all fragments from the forest, but does not remove its configuration information.
To clear all data from a forest, complete the following procedure:
Clearing data in a forest is a hot admin task; the changes take effect immediately.
You can disable a forest using the Admin Interface. Disabling a forest unmounts the forest from the database and clears all memory caches for all the forests in the database. The database remains unavailable for any query operations while any of its forests are disabled.
Disabling a forest does not delete the configuration or document data. The forest can later be re-enabled by clicking Enable.
You can use the Admin Interface to delete a forest. The are two levels of forest deletion:
The forest cannot be deleted if it is still attached to a database. Also, you can delete the configuration information on a Read-Only or Flash-Backup forest, but you cannot do a Full Delete on such forests.
To delete a forest, complete the following procedure:
Deleting a forest is a hot task; the changes take effect immediately.
MarkLogic Server transactions may participate in global, distributed XA transactions. The XA Transaction Manager usually manages the life cycle of transactions participating in an XA transaction, independent of MarkLogic Server. However, it may be necessary to manually rollback the MarkLogic Server portion of a global transaction (called a branch) if the Transaction Manager is unreachable for a long time. For details, see Heuristically Completing a Stalled Transaction in the XCC Developer's Guide.
Heuristic completion bypasses the Transaction Manager and the Two Phase Commit process, so it can lead to data integrity problems. Use heuristic completion only as a last resort.
Before the MarkLogic Server branch of an XA transaction is prepared, the transaction may be rolled back from the host status page of the host evaluating the transaction. See Rolling Back a Transaction.
Once the MarkLogic Server branch of an XA transaction enters the prepared state, the transaction appears only on the forest status page of the coordinating forest. To find the coordinating forest, examine the Forest Status page for each forest belonging to the participating database. The transaction will only appear on the status page for the coordinating forest.
To heuristically rollback the MarkLogic Server portion of an XA transaction using the Admin Interface, follow these steps:
http://yourhost:8001
.[rollback]
on the right side of the target transaction status to initiate the rollback. The rollback confirmation dialog appears. For example:The rolled back transaction enters the remember abort state, indicating MarkLogic Server should remember that the local transaction was aborted until the Transaction Manager re-synchronizes the global transaction. Once re-synchronization occurs, the transaction no longer appears in the forest status. For details, see Heuristically Completing a MarkLogic Server Transaction in the XCC Developer's Guide.
You may use the Forest Status page to force MarkLogic Server to forget the rollback without waiting for the Transaction Manager. This is not recommended as it leads to errors and, potentially, a loss of data integrity when the Transaction Manager attempts to re-synchronize the global transaction. If forgetting the rollback is necessary, use the [forget]
link in the transaction list on the Forest Status: