QueryBatcher (MarkLogic Java Client API 7.0.0)

All Superinterfaces:

Batcher
```
public interface QueryBatcher
extends Batcher
```
To facilitate long-running read, update, and delete use cases, coordinates threads to process batches of uris matching a query or coming from an Iterator. Each batch of uris matching a query will come from a single forest. The host for that forest is the target of the DatabaseClient provided to the listener's processEvent method. The query is performed directly on each forest associated with the database for the DatabaseClient provided to DataMovementManager. The end goal of each job is determined by the listeners registered with onUrisReady. The data set from which batches are made and on which processing is performed is determined by the query or Iterator used to construct this instance.
While the most custom use cases will be addressed by custom listeners, the common use cases are addressed by provided listeners, including ApplyTransformListener, DeleteListener, ExportListener, and ExportToWriterListener. The provided listeners are used by adding an instance via onUrisReady like so:
```
     QueryBatcher qhb = dataMovementManager.newQueryBatcher(query)
         .withConsistentSnapshot()
         .onUrisReady( new DeleteListener() )
         .onQueryFailure(exception -> exception.printStackTrace());
     JobTicket ticket = dataMovementManager.startJob(qhb);
     qhb.awaitCompletion();
     dataMovementManager.stopJob(ticket);
```
Custom listeners will generally use the [MarkLogic Java Client API][] to manipulate the documents for the uris in each batch.

QueryBatcher is designed to be highly scalable and performant. To accommodate the largest result sets, QueryBatcher paginates through matches rather than loading matches into memory. To prevent queueing too many tasks when running a query, QueryBatcher only adds another task when one completes the query and is about to send the matching uris to the onUrisReady listeners.

For pagination to succeed, you must not modify the result set during pagination. This means you must
1. perform a read-only operation, or
2. make sure modifications do not modify the result set by deleting matches or modifying them to no longer match, or
3. set a merge timestamp and use withConsistentSnapshot(), or
4. use Iterator instead of a query.
Sample usage using withConsistentSnapshot():
```
     QueryDefinition query = new StructuredQueryBuilder().collection("myCollection");
     QueryBatcher qhb = dataMovementManager.newQueryBatcher(query)
         .withBatchSize(1000)
         .withThreadCount(20)
         .withConsistentSnapshot()
         .onUrisReady(batch -> {
             for ( String uri : batch.getItems() ) {
                 if ( uri.endsWith(".txt") ) {
                     client.newDocumentManager().delete(uri);
                 }
             }
         })
         .onQueryFailure(exception -> exception.printStackTrace());
     JobTicket ticket = dataMovementManager.startJob(qhb);
     qhb.awaitCompletion();
     dataMovementManager.stopJob(ticket);
```
Example of queueing uris in memory instead of using withConsistentSnapshot():
```
     ArrayList<String> uris = Collections.synchronizedList(new ArrayList<>());
     QueryBatcher getUris = dataMovementManager.newQueryBatcher(query)
       .withBatchSize(5000)
       .onUrisReady( batch -> uris.addAll(Arrays.asList(batch.getItems())) )
       .onQueryFailure(exception -> exception.printStackTrace());
     JobTicket getUrisTicket = dataMovementManager.startJob(getUris);
     getUris.awaitCompletion();
     dataMovementManager.stopJob(getUrisTicket);

     // now we have the uris, let's step through them
     QueryBatcher performDelete = moveMgr.newQueryBatcher(uris.iterator())
       .onUrisReady(new DeleteListener())
       .onQueryFailure(exception -> exception.printStackTrace());
     JobTicket ticket = dataMovementManager.startJob(performDelete);
     performDelete.awaitCompletion();
     dataMovementManager.stopJob(ticket);
```
To queue uris to disk (if not enough memory is available) see UrisToWriterListener.

Method Summary

All Methods Instance Methods Abstract Methods
Modifier and Type	Method and Description
`boolean`	`awaitCompletion()` Blocks until the job is complete.
`boolean`	`awaitCompletion(long timeout, java.util.concurrent.TimeUnit unit)` Blocks until the job is complete.
`int`	`getDefaultDocBatchSize()` Returns defaultDocBatchSize, which is calculated according to server status
`int`	`getDocToUriBatchRatio()` Returns docToUriBatchRatio set to the QueryBatcher
`JobTicket`	`getJobTicket()` After the job has been started, returns the JobTicket generated when the job was started.
`long`	`getMaxBatches()` Returns the maximum number of Batches for the current job.
`int`	`getMaxDocToUriBatchRatio()` Returns maxDocToUriBatchRatio, which is calculated according to server status
`int`	`getMaxUriBatchSize()` Returns maxUriBatchSize, which is calculated according to server status
`QueryFailureListener[]`	`getQueryFailureListeners()` Get the array of QueryFailureListener instances registered via onBatchFailure including the HostAvailabilityListener registered by default.
`QueryBatcherListener[]`	`getQueryJobCompletionListeners()` Get the array of QueryBatcherListener instances registered via onJobCompletion.
`java.lang.Long`	`getServerTimestamp()` If `withConsistentSnapshot` was used before starting the job, will return the MarkLogic server timestamp associated with the snapshot.
`QueryBatchListener[]`	`getUrisReadyListeners()` Get the array of QueryBatchListener instances registered via onUrisReady.
`boolean`	`isStopped()` true if the job is terminated (last batch was finished or `DataMovementManager.stopJob` was called), false otherwise
`QueryBatcher`	`onJobCompletion(QueryBatcherListener listener)` Add a listener to run when the Query job is completed i.e.
`QueryBatcher`	`onQueryFailure(QueryFailureListener listener)` Add a listener to run each time there is an exception retrieving a batch of uris.
`QueryBatcher`	`onUrisReady(QueryBatchListener listener)` Add a listener to run each time a batch of uris is ready.
`void`	`retry(QueryEvent queryEvent)` Retry in the same thread to query a batch that failed.
`void`	`retryListener(QueryBatch batch, QueryBatchListener queryBatchListener)` Retries processing the listener to the batch of URIs, when the batch has been successfully retrieved from the server but applying the listener on the batch failed.
`void`	`retryWithFailureListeners(QueryEvent queryEvent)` Retry in the same thread to query a batch that failed.
`void`	`setMaxBatches()` Caps the query at the current batch.
`void`	`setMaxBatches(long maxBatches)` Sets the limit for the maximum number of batches that can be collected.
`void`	`setQueryFailureListeners(QueryFailureListener... listeners)` Remove any existing QueryFailureListener instances registered via onBatchFailure including the HostAvailabilityListener registered by default and replace them with the provided listeners.
`void`	`setQueryJobCompletionListeners(QueryBatcherListener... listeners)` Remove any existing QueryBatcherListener instances registered via onJobCompletion and replace them with the provided listeners.
`void`	`setUrisReadyListeners(QueryBatchListener... listeners)` Remove any existing QueryBatchListener instances registered via onUrisReady and replace them with the provided listeners.
`QueryBatcher`	`withBatchSize(int docBatchSize)` Sets the number of documents processed in a batch.
`QueryBatcher`	`withBatchSize(int docBatchSize, int docToUriBatchRatio)` Sets the number of documents processed in a batch and the ratio of the document processing batch to the document uri collection batch.
`QueryBatcher`	`withConsistentSnapshot()` Specifies that matching uris should be retrieved as they were when this QueryBatcher job started.
`QueryBatcher`	`withForestConfig(ForestConfiguration forestConfig)` If the server forest configuration changes mid-job, it can be re-fetched with `DataMovementManager.readForestConfig()` then set via withForestConfig.
`QueryBatcher`	`withJobId(java.lang.String jobId)` Sets the unique id of the job to help with managing multiple concurrent jobs and start the job with the specified job id.
`QueryBatcher`	`withJobName(java.lang.String jobName)` Sets the job name.
`QueryBatcher`	`withThreadCount(int threadCount)` Sets the number of threads added to the internal thread pool for this instance to use for retrieving or processing batches of uris.

Methods inherited from interface com.marklogic.client.datamovement.Batcher
getBatchSize, getForestConfig, getJobEndTime, getJobId, getJobName, getJobStartTime, getPrimaryClient, getThreadCount, isStarted

- Method Detail
  - onUrisReady
```
QueryBatcher onUrisReady(QueryBatchListener listener)
```
    Add a listener to run each time a batch of uris is ready.
    
    Parameters:
    
    listener - the action which has to be done when uris are ready
    
    Returns:
    
    this instance for method chaining
  - onQueryFailure
```
QueryBatcher onQueryFailure(QueryFailureListener listener)
```
    Add a listener to run each time there is an exception retrieving a batch of uris.
    
    These listeners will not run when an exception is thrown by a listener registered with onUrisReady. To learn more, please see Handling Exceptions in Listeners
    
    Parameters:
    
    listener - the code to run when a failure occurs
    
    Returns:
    
    this instance for method chaining
  - onJobCompletion
```
QueryBatcher onJobCompletion(QueryBatcherListener listener)
```
    Add a listener to run when the Query job is completed i.e. when all the document URIs are retrieved and the associated listeners are completed
    
    Parameters:
    
    listener - the code to run when the Query job is completed
    
    Returns:
    
    this instance for method chaining
  - retry
```
void retry(QueryEvent queryEvent)
```
    Retry in the same thread to query a batch that failed. This method will throw an Exception if it fails again, so it can be wrapped in a try-catch block.
    
    Parameters:
    
    queryEvent - the information about the batch that failed
  - getUrisReadyListeners
```
QueryBatchListener[] getUrisReadyListeners()
```
    Get the array of QueryBatchListener instances registered via onUrisReady.
    
    Returns:
    
    the QueryBatchListener instances this batcher is using
  - getQueryJobCompletionListeners
```
QueryBatcherListener[] getQueryJobCompletionListeners()
```
    Get the array of QueryBatcherListener instances registered via onJobCompletion.
    
    Returns:
    
    the QueryBatcherListener instances this batcher is using
  - getQueryFailureListeners
```
QueryFailureListener[] getQueryFailureListeners()
```
    Get the array of QueryFailureListener instances registered via onBatchFailure including the HostAvailabilityListener registered by default.
    
    Returns:
    
    the QueryFailureListener instances this batcher is using
  - setUrisReadyListeners
```
void setUrisReadyListeners(QueryBatchListener... listeners)
```
    Remove any existing QueryBatchListener instances registered via onUrisReady and replace them with the provided listeners.
    
    Parameters:
    
    listeners - the QueryBatchListener instances this batcher should use
  - setQueryFailureListeners
```
void setQueryFailureListeners(QueryFailureListener... listeners)
```
    Remove any existing QueryFailureListener instances registered via onBatchFailure including the HostAvailabilityListener registered by default and replace them with the provided listeners.
    
    Parameters:
    
    listeners - the QueryFailureListener instances this batcher should use
  - setQueryJobCompletionListeners
```
void setQueryJobCompletionListeners(QueryBatcherListener... listeners)
```
    Remove any existing QueryBatcherListener instances registered via onJobCompletion and replace them with the provided listeners.
    
    Parameters:
    
    listeners - the QueryBatcherListener instances this batcher should use
  - withConsistentSnapshot
```
QueryBatcher withConsistentSnapshot()
```
    Specifies that matching uris should be retrieved as they were when this QueryBatcher job started. This enables a point-in-time query so that the set of uri matches is as it was at that point in time. This requires that the server be configured to allow such queries by setting the [merge timestamp][] to a timestamp before the job starts or a sufficiently large negative value. This should only be used when the QueryBatcher is constructed with a query, not with an Iterator. This is required when performing a delete of documents matching the query or any modification (including ApplyTransformListener) of matching documents which would cause them to no longer match the query (otherwise pagination through the result set would fail because pages shift as documents are deleted or modfied to no longer match the query).
    
    Returns:
    
    this instance for method chaining
  - withForestConfig
```
QueryBatcher withForestConfig(ForestConfiguration forestConfig)
```
    If the server forest configuration changes mid-job, it can be re-fetched with DataMovementManager.readForestConfig() then set via withForestConfig.
    
    Specified by:
    
    withForestConfig in interface Batcher
    
    Parameters:
    
    forestConfig - the updated ForestConfiguration
    
    Returns:
    
    this instance for method chaining
  - withJobName
```
QueryBatcher withJobName(java.lang.String jobName)
```
    Sets the job name. Eventually, this may become useful for seeing named jobs in ops director.
    
    Specified by:
    
    withJobName in interface Batcher
    
    Parameters:
    
    jobName - the name you would like to assign to this job
    
    Returns:
    
    this instance for method chaining
  - withJobId
```
QueryBatcher withJobId(java.lang.String jobId)
```
    Sets the unique id of the job to help with managing multiple concurrent jobs and start the job with the specified job id.
    
    Specified by:
    
    withJobId in interface Batcher
    
    Parameters:
    
    jobId - the unique id you would like to assign to this job
    
    Returns:
    
    this instance (for method chaining)
  - withBatchSize
```
QueryBatcher withBatchSize(int docBatchSize)
```
    Sets the number of documents processed in a batch.
    
    Specified by:
    
    withBatchSize in interface Batcher
    
    Parameters:
    
    docBatchSize - the number of documents processed in a batch
    
    Returns:
    
    this instance for method chaining
  - withBatchSize
```
QueryBatcher withBatchSize(int docBatchSize,
                           int docToUriBatchRatio)
```
    Sets the number of documents processed in a batch and the ratio of the document processing batch to the document uri collection batch. For example, if docBatchSize is 100 and docToUriBatchRatio is 5, the document processing batch size is 100 and the document URI collection batch is 500.
    
    Parameters:
    
    docBatchSize - the number of documents processed in a batch
    
    docToUriBatchRatio - the ratio of the document processing batch to the document uri collection batch. The docToUriBatchRatio should ordinarily be larger than 1 because URIs are small relative to full documents and because collecting URIs from indexes is ordinarily faster than processing documents.
    
    Returns:
    
    this instance for method chaining
  - getDocToUriBatchRatio
```
int getDocToUriBatchRatio()
```
    Returns docToUriBatchRatio set to the QueryBatcher
    
    Returns:
    
    docToUriBatchRatio
  - getDefaultDocBatchSize
```
int getDefaultDocBatchSize()
```
    Returns defaultDocBatchSize, which is calculated according to server status
    
    Returns:
    
    defaultDocBatchSize
  - getMaxUriBatchSize
```
int getMaxUriBatchSize()
```
    Returns maxUriBatchSize, which is calculated according to server status
    
    Returns:
    
    maxUriBatchSize
  - getMaxDocToUriBatchRatio
```
int getMaxDocToUriBatchRatio()
```
    Returns maxDocToUriBatchRatio, which is calculated according to server status
    
    Returns:
    
    maxDocToUriBatchRatio
  - withThreadCount
```
QueryBatcher withThreadCount(int threadCount)
```
    Sets the number of threads added to the internal thread pool for this instance to use for retrieving or processing batches of uris. For queries these threads both retrieve and process batches. For queries one batch per forest is queued immediately, then subsequent batches per forest are only queued after each previous batch is retrieved. This means more threads than the number of forests is likely to be beneficial only when time is spent in the listeners registered with onUrisReady, for example if ApplyTransformListener, DeleteListener, ExportListener, or ExportToWriterListener are used since each of these makes additional requests to the server. For Iterators, the main thread (the one calling startJob) is used to queue all batches--so startJob will not return until all iteration is complete and all batches are queued. For Iterators this thread count is the number of threads used for processing the queued batches (running processEvent on the listeners regiested with onUrisReady). As of the 6.2.0 release, this can now be adjusted after the batcher has been started. The underlying Java ThreadPoolExecutor will have both its core and max pool sizes set to the given thread count. Use caution when reducing this to a value of 1 while the batcher is running; in some cases, the underlying ThreadPoolExecutor may halt execution of any tasks. Execution can be resumed by increasing the thread count to a value of 2 or higher.
    
    Specified by:
    
    withThreadCount in interface Batcher
    
    Parameters:
    
    threadCount - the number of threads to use in this Batcher
    
    Returns:
    
    this instance for method chaining
  - awaitCompletion
```
boolean awaitCompletion()
```
    Blocks until the job is complete.
    
    Returns:
    
    true if the job completed without InterruptedException, false if InterruptedException was thrown while waiting
  - awaitCompletion
```
boolean awaitCompletion(long timeout,
                        java.util.concurrent.TimeUnit unit)
                 throws java.lang.InterruptedException
```
    Blocks until the job is complete.
    
    Parameters:
    
    timeout - the maximum time to wait
    
    unit - the time unit of the timeout argument
    
    Returns:
    
    true if the job completed without timing out, false if we hit the time limit
    
    Throws:
    
    java.lang.InterruptedException - if interrupted while waiting
  - isStopped
```
boolean isStopped()
```
    true if the job is terminated (last batch was finished or DataMovementManager.stopJob was called), false otherwise
    
    Specified by:
    
    isStopped in interface Batcher
    
    Returns:
    
    true if the job is terminated (last batch was finished or DataMovementManager.stopJob was called), false otherwise
  - getJobTicket
```
JobTicket getJobTicket()
```
    After the job has been started, returns the JobTicket generated when the job was started.
    
    Specified by:
    
    getJobTicket in interface Batcher
    
    Returns:
    
    the JobTicket generated when this job was started
    
    Throws:
    
    java.lang.IllegalStateException - if this job has not yet been started
  - retryListener
```
void retryListener(QueryBatch batch,
                   QueryBatchListener queryBatchListener)
```
    Retries processing the listener to the batch of URIs, when the batch has been successfully retrieved from the server but applying the listener on the batch failed.
    
    Parameters:
    
    batch - the QueryBatch for which we need to process the listener
    
    queryBatchListener - the QueryBatchListener which needs to be applied
  - retryWithFailureListeners
```
void retryWithFailureListeners(QueryEvent queryEvent)
```
    Retry in the same thread to query a batch that failed. If it fails again, all the failure listeners associated with the batcher using onQueryFailure method would be processed. Note : Use this method with caution as there is a possibility of infinite loops. If a batch fails and one of the failure listeners calls this method to retry with failure listeners and if the batch again fails, this would go on as an infinite loop until the batch succeeds.
    
    Parameters:
    
    queryEvent - the information about the batch that failed
  - setMaxBatches
```
void setMaxBatches(long maxBatches)
```
    Sets the limit for the maximum number of batches that can be collected.
    
    Parameters:
    
    maxBatches - is the value of the limit.
  - setMaxBatches
```
void setMaxBatches()
```
    Caps the query at the current batch.
  - getMaxBatches
```
long getMaxBatches()
```
    Returns the maximum number of Batches for the current job.
    
    Returns:
    
    the maximum number of Batches that can be collected.
  - getServerTimestamp
```
java.lang.Long getServerTimestamp()
```
    If withConsistentSnapshot was used before starting the job, will return the MarkLogic server timestamp associated with the snapshot. Returns null otherwise.
    
    Returns:
    
    the timestamp or null

Interface QueryBatcher

Method Summary

Methods inherited from interface com.marklogic.client.datamovement.Batcher

Method Detail

onUrisReady

onQueryFailure

onJobCompletion

retry

getUrisReadyListeners

getQueryJobCompletionListeners

getQueryFailureListeners

setUrisReadyListeners

setQueryFailureListeners

setQueryJobCompletionListeners

withConsistentSnapshot

withForestConfig

withJobName

withJobId

withBatchSize

withBatchSize

getDocToUriBatchRatio

getDefaultDocBatchSize

getMaxUriBatchSize

getMaxDocToUriBatchRatio

withThreadCount

awaitCompletion

awaitCompletion

isStopped

getJobTicket

retryListener

retryWithFailureListeners

setMaxBatches

setMaxBatches

getMaxBatches

getServerTimestamp