This chapter describes how to configure your MarkLogic Servers for replication. The topics in this chapter assume you are familiar with the replication principles described in Understanding Flexible Replication and the basic configuration procedures described in the Flexible Replication Quick Start.
flexrep-admin role is required to configure replication. The user who will access the Replica App Server when pushing, or access the Master App Server when pulling, requires the
flexrep-user role. Though you will typically configure replication as the Admin user, you should create a unique replication user to be associated with the replication tasks. The replication user must be given the
flexrep-user role and have the privileges necessary to update the domain content on both the Master and Replica App Servers.
You can configure SSL on your Master and Replica App Servers to encrypt the replicated data passed between them. For details on configuring SSL on App Server, see the Configuring SSL on App Servers chapter in the Administrator's Guide.
If SSL on the Replica App Server is configured to require a client certificate from the Master, the Target URL must start with HTTPS. Paste the PEM-encoded client certificate and client key in the fields located at the bottom of the Database Replication Target Administration page. If the client key is encrypted, you must also specify a pass phase.
The same security configuration applies when configuring the Replica App Server for pull replication, if the Master App Server requires a client certificate from the Replica. For details on how to configure pull replication, see Creating a Pull Replication Task
Before you can define replicated domains, you must enable CPF as described in Configuring a Flexible Replication Pipeline for the Master Database.
CPF domains are described in detail in the Understanding and Using Domains chapter in the Content Processing Framework Guide Guide. The purpose of this section is to describe how to create a domain that defines the scope of the documents in the Master database to be replicated to the Replica database.
The following procedure describes how to create a new domain for defining the scope of the replicated documents. By default, CPF creates a Default Master domain to replicate from the '/' root directory of the Master database. The '/' root means that the document URIs must be preceded by a '/', such as
/content/foo.xml. You can also set the Document Scope of the domain to replicate a user collection or an individual document.
/projects/Baz/directory on the Master database, as shown below.
You may receive a warning that this domain overlaps with the Default Master domain. This means that the both the Baz domain and the Default Master domain are configured to replicate the documents in the
/projects/Baz/ directory. You cannot have any document that is in more than one domain, so you must either delete or modify the Default Master to resolve this conflict.
If you want to create a domain to replicate the entire database, you can assign a default collection to all of your users and then set the Document Scope to Collections and the URI to the name of the default collection. For details on how to establish a default collection for a user, see Creating a User in the Administrator's Guide.
As described in Creating a Replication App Server, in a push replication configuration, the Replica database must be connected to a Replication App Server. In a pull replication configuration, the Master database must be connected to a Replication App Server. This section goes into more detail on creating and configuring Replication App Servers.
The purpose of the other fields in the HTTP Server Configuration page are described in Creating a New HTTP Server in the Administrator's Guide. If you are going to configure SSL on the App Server to require a client certificate, you must specify the certificate in the Database Replication Target Administration configuration page, as described in Replication Security.
For example, if you are configuring the Master database to push updates to a Replica App Server that requires a client certificate, you must include a client certificate in the Database Replication Target Administration configuration page for the Master database. If you are configuring a Replica App Server to pull updates from the Master database that requires a client certificate, you must you must include a client certificate in the Replication Pull configuration page for the Replica database. For details on providing a client certificate for either a push or pull replication configuration, see Replication Security.
Before you can configure a replication target, you must have content processing enabled for the Master database. After configuring push replication, you must create a scheduled task, as described in Creating a Push-Local-Forests Replication Task.
Under Authentication, specify the username and password for a user assigned the
flexrep-user role. This user must have the same username/password as the user on the Master App Server who creates and updates the documents. This user needs to have sufficient permissions on the target system to insert and update the replicated documents.
Because Flexible Replication passes through the permissions set on the Master App Server when replicating, the user on the target must have permissions to update the document later on. Permissions can either be configured by the administrator on both the Master and Replica App Server, or you can use a filter to adjust the document permissions so the target user can later update the document, as described in Adding Document Permissions.
Regardless of whether you configure replication as push or pull, you must create a scheduled task to periodically replicate updated content on the Master to the Replicas. A scheduled replication task does the following:
The push-local-forests task setting is the preferred method for retry and zero-day replication. Once the initial scheduled interval is reached, this task pushes each batch of documents, immediately following the previous batch, regardless of the scheduled task frequency. If you had previously used the push task, switching over to the push-local-forests setting will provide better performance.
The push-local-forests task for each forest can only be respawned a maximum of 500 times per replication period. in order to avoid overflowing the task limit. After 500 respawns, the task will wait for the next replication period to start replicating again. For each respawn, the task will replicate the batch size to the replica.
The push task pushes a batch of documents from all of the forests in the database. This task pushes one batch from the entire database per scheduled interval. This form of push replication is available for users who configured replication on earlier versions of MarkLogic Server. If you are currently configured for this form of push replication, you will obtain better performance if you switch to push-local-forests replication.
The pull task is configured on a Replica database to pull data from the Master database. This is the only replication option available to a Replica, so the pull scheduled task must serve as the initial replication mechanism in addition to retry and zero-day replication.
For all scheduled replication tasks, the replication retry configuration is a combination of the scheduled task frequency settings and the target settings, Retry Seconds Min, Retry Seconds Max, and Document Per Batch.
The task type in the Scheduled Task Configuration page indicates the interval at which each scheduled replication task is to run. In most configurations, the task type will be minutely. For details on configuring scheduled tasks, see Scheduling Tasks in the Administrator's Guide.
The Retry Seconds Min/Max settings in the Database Replication Target Administration page indicate the minimum and maximum number of seconds any documents that failed to replicate are eligible for replication retries. In a push configuration, the Documents Per Batch setting specifies how many documents that have failed to replicate are to be retried during each scheduled task interval. In a pull configuration, the Documents Per Batch setting specifies the total number of replicated documents (retries of failed documents and newly added or updated documents) that are to be pulled from the Master during each scheduled task interval.
For example, the retry minimum is 30 seconds, the retry maximum is 300 seconds (five minutes) and the scheduled replication task period is every one minute. If a document fails replication, it is eligible to be retried in 30 seconds. This means that MarkLogic Server will attempt to replicate the document at the next minute interval. The retry interval is doubled each time the document fails to replicate until the interval reaches the maximum retry setting, at which time the retry interval remains at the maximum. So, if the document in our example fails a second time, it will be eligible for retry in one minute. Should the document fail to replicate the second time, the retry interval is set to two minutes, and so on until the interval reaches the five-minute retry maximum setting, after which MarkLogic Server tries to replicate the document every five minutes.
The Documents Per Batch setting also plays a role with replication retries. For example, if the batch value is one and five documents have failed to replicate, then MarkLogic Server will only attempt to retry one failed document that is eligible for replication retries during each scheduled task interval. The document selected for replication retry is the earliest eligible document.
Though the Documents Per Batch setting is 1 by default, a more typical value is in the range of 10 - 100. If there is a large number of documents to be replicated in a zero-day fashion, you can maximize the load speed by closely matching the scheduled interval with the Documents Per Batch setting, so that the maximum number of documents are loaded within the time interval. For example, if your scheduled interval is one minute and you have determined that you can replicate a maximum of 250 documents per minute, then the optimum Documents Per Batch setting would be 250.
When a flexible replication domain includes multiple targets, the scheduled task will use the smallest of the configured Documents per Batch settings. So if one target has Documents per Batch set to 1, and another has Documents per Batch set to 100, the scheduled task will use a batch size of 1.
This section describes how to configure a push-local-forests replication task to provide the Master with the means to replicate deletes to the Replica, retry replication on documents that have failed the initial attempt to replicate, and to replicate documents that were in the replicated domain before replication was configured.
In a pull replication configuration, the Replicas pull updates from the Master. Unlike push replication, which replicates as soon as the documents on the Master are updated, the only way to configure pull replication is by means of a scheduled task.
flexrep-userrole on a Replica host. Under task host, select a Replica host or leave the field empty so the task can run on any host in the cluster. Click OK.
Pull Replication is useful when a Replica is behind a firewall that only allows its internal servers to pull from a Master server outside the firewall. This section describes how to configure your Replica database for Pull Replication.
If the Replica is to act as a Master for another Replica, then you must have CPF enabled on the Replica, as described in Configuring a Flexible Replication Pipeline for the Master Database. However, if the Replica has CPF installed and it is not to act as a Master, then you must disable the Flexible Replication pipeline on the Replica.
Combining alerting with flexible replication is often referred to as QBFR (Query-Based Flexible Replication). Query-based flexible replication enables customizable information sharing (using filters) between systems, allowing for the easy and secure distribution of portions of data even across disconnected, intermittent, and latent networks.
This type of replication is based on a query (an alert) that triggers because of an event that matches that query. A user can have more than one alert, in which case they would receive documents that match any of their alerts. In addition to queries, the permissions for a user are taken into account. The user will only receive replicated content that they have permission to view in the database. If the permissions change, the replica will be updated accordingly. Most often query-based flexible replication is a pull configuration, but it can also be set up as a push configuration.
By setting up alerts, replication takes place any time content in the database matches that query. Any new or updated content within the domain scope will cause all matching rules or alerts to perform their corresponding action. Query-based flexible replication can be used with filters to share specific documents or parts of documents. Filters can be set up in either a push or pull configuration.
To set up query-based flexible replication, you need to have the environment for flexible replication already configured. See Flexible Replication Quick Start for information on setting up flexible replication using the Admin UI. Configuring Query-based Replication using the REST API in the Scripting Administrative Tasks Guide explains how to set up query-based flexible replication including alerts using scripting and REST.
QBFR users should have the
flexrep-user role. QBFR targets will only replicate documents for which the associated user has read permission, as in normal MarkLogic security. Your configuration will most likely have additional or more complex permissions on documents.
Query-based flexible replication requires an alert to trigger the replication. These steps use XQuery to access MarkLogic built-in functions to create and configure alerts. To see the complete scripted version of this process, see Configuring Query-based Replication using the REST API in the Scripting Administrative Tasks Guide.
This section includes examples of these tasks using XQuery. See Configuring Query-based Replication using the REST API in the Scripting Administrative Tasks Guide for a complete scripting example using REST.
Once you have your flexible replication environment configured, to set up query-based flexible replication you need to create an alerting domain. This example uses alert:make-config to create an alerting domain named 'http://acme.com/alerting' containing 'alerting rules for query-based flexrep' and then alert:config-insert is used to add it to the acme.com alert configuration:
xquery version "1.0-ml"; import module namespace alert = "http://marklogic.com/xdmp/alert" at "/MarkLogic/alert.xqy"; alert:action-insert("http://acme.com/alerting", alert:make-action( "log", "QBFR log action", xdmp:database("master-modules"), "/", "/log.xqy", <alert:options/>))'
Traditional alerting in MarkLogic requires that
log.xqy exist in the modules database so it can be called when the alert triggers. For QBFR
log.xqy will not be called and therefore does not actually need to exist.
Next you need to associated the alerting configuration with a CPF domain. Use flexrep:configuration-set-alerting-uri to associate the configure domain (
my-cpf-domain in this example) to be used for flexible replication.
xquery version "1.0-ml"; import module namespace alert = "http://marklogic.com/xdmp/alert" at "/MarkLogic/alert.xqy"; let $domain := "my-cpf-domain" let $cfg := flexrep:configuration-get($domain, fn:true()) let $domain-id := flexrep:configuration-insert( flexrep:configuration-set-alerting-uri($cfg, flexrep:domain-alerting-uri($domain-id)))
Once alerting has been set up, the next step is to associate the QBFR target with a user using flexrep:configuration-target-set-user-id.
xquery version "1.0-ml"; import module namespace flexrep = "http://marklogic.com/xdmp/flexible-replication" at "/MarkLogic/flexrep.xqy"; let $domain := "my-cpf-domain" let $cfg := flexrep:configuration-get($domain, fn:true()) let $target-id := flexrep:configuration-target-get-id($cfg, "QBFR target") let $user-id := xdmp:user("User1") flexrep:configuration-insert( flexrep:configuration-target-set-user-id( $cfg, $target-id, $user-id))
The user can also be configured with flexrep:target-create when creating the target.
The next step is to create an alerting rule for the replication using alert:make-rule. This example uses alert:make-rule to create an alert that says if any new content contains the words 'dna' or 'rna' send an email alert to
xquery version "1.0-ml"; import module namespace alert = "http://marklogic.com/xdmp/alert" at "/MarkLogic/alert.xqy"; alert:make-rule( "nucleic acids email", "Alert me to anything concerning nucleic acids", 0, cts:or-query(( cts:word-query("dna"), cts:word-query("rna") )), "email", <alert:options> <alert:email-address>email@example.com</alert:email-address> </alert:options> )
See Configuring Query-based Replication using the REST API in the Scripting Administrative Tasks Guide for a complete scripting example, including how to create an alerting rule. For more details about creating an alert, see Creating Alerting Applications in the Search Developer's Guide.
Query-based flexible replication is useful in circumstances where you need to share specific information (documentation, photographs, and so on) with many others in different locations. to improve security or reliability, flexible replication allows data to be transformed and filtered before replication. This enables control of what documents or parts of documents are shared, or how data should be presented. In remote locations you might have disconnected devices or intermittent connectivity, so a pull setup will be more effective.
Query-based flexible replication enables people in the field to have access to specific, targeted information in a timely manner, be able to update or add to that information, and then replicate the results back to headquarters, with all the security and reliablility provided by MarkLogic. For example, geologists for oil and gas companies can take information that they need with them to remote locations, perform field analysis, take pictures, write up reports, and share that information when they reconnect to the network.
Safety inspectors could perform a quick search online to acquire the exact information they need to inspect a certain location and replicate the data to a laptop. When the on-site inspection has been performed, analysis and information can be updated locally. Once connection to the network has been established, the results are replicated and shared back the main office.
Researchers working in different company locations can share large data sets of information, with fast local access and the ability to independently update the data. The information can then be aggregated and shared, as appropriate, with others throughout the company.
With a push configuration, if the devices are disconnected, the server will continue to retry the operation to send the information. You could end up with a document error condition if you have too many retries in a row. The flexrep:document-reset function can be used to clear the error condition and schedule replication of the document. You can use
flexrep:domain-target-status to query for documents that have had an error in replication.
The configuration for flexible replication is stored in the Master and Replica databases. Clearing a Master database has the effect of deleting the target configurations, as well as all document-level information about target replication status. When you backup a Master database, you are also saving the replication configuration information. Should you restore that database to another cluster, the flexible replication configuration will also be present for the database in that cluster. Should you restore a database that is not configured for replication to one that was previously configured for replication, the replication configuration will be lost.
Before clearing or restoring a Master database, you can save the replication configuration by calling the
flexrep:configuration-get function. Once the Master is cleared or restored, the replication configuration can be restored using the
flexrep:configuration-insert function. Alternatively you can recreate the replication configuration programmatically, as described in Scripting Flexible Replication Configuration in the Scripting Administrative Tasks Guide.
Clearing a Replica database impacts the Master in that it no longer understands the status of the target. The same issues related to backing up and restoring the Master database described above also apply to backing up and restoring the Replica database. If the target status is lost on the Replica, new updates will replicate to the target, but the Master will not know that the older, unchanged documents are missing. Before clearing or restoring a Replica database, disable replication. After clearing or restoring the Replica database, delete the existing target on the Master, create a new target with the same configuration as the old one, and re-enable replication. The scheduled task will gradually populate the new target's database with all the documents in the domain.
If you have pull replication configured on the Replica, then you can save the configuration by calling the flexrep:pull-get function before clearing the database. After clearing or restoring the Replica database, you can restore the pull replication configuration by calling the
In flexible replication, a large binary is replicated in chunks to the replica. In normal circumstances, when all chunks are replicated, the replica will reassemble these chunks back into one large binary. If the master dies while replicating some of the chunks, the chunks that were already replicated will be left on the replica. When the master comes back online again, the process will resume.
In some cases the master may be permanently removed while the chunks are being replicated. To reclaim disk space, you can use the timestamp of the binary chunks to find and remove them. The
flexrep:binary-chunk-uris(ts as xs:dateTime)function returns the URIs of all binary chunks that are older than the given wall clock time. This will list all of the binary chunks that are older than the time specified by
xquery version "1.0-ml"; import module namespace flexrep = "http://marklogic.com/xdmp/flexible-replication" at "/MarkLogic/flexrep.xqy"; flexrep:binary-chunk-uris(xs:dateTime("2014-10-01T08:00:00")) (: Returns the URIs of binary chunks that were created before 2014-10-01T08:00:00. :)
The flexrep:binary-chunk-uris API requires the
flexrep-admin priviledge and the URI lexicon must be enabled (by default it is enabled).
xquery version "1.0-ml"; import module namespace flexrep = "http://marklogic.com/xdmp/flexible-replication" at "/MarkLogic/flexrep.xqy"; let $delete := <flexrep:delete xmlns:flexrep="http://marklogic.com/xdmp/flexible-replication"> <doc:uri xmlns:doc="xdmp:document-load">/content/foo.xml</doc:uri> <flexrep:last-updated>2010-09-28T14:35:12.714-08:00</flexrep:last-updated> </flexrep:delete> return flexrep:delete($delete) (: Applies the specified delete element to /content/foo.xml. This effectively deletes the document from both the Master and Replica databases. :)