Loading TOC...
Node.js Application Developer's Guide (PDF)

Node.js Application Developer's Guide — Chapter 11

Data Movement in the Node.js API

Node.js APIs can be used for data movement either with the out-of-the-box endpoints of the REST API or with a Data Service endpoint for bulk input and/or output.

Concurrency and Large Data Sets in Node.js

Node.js provides concurrency through multiple waits for IO responses, instead of multiple threads. This strategy avoids the challenges and risks of multi-threaded programming for middle-tier clients that are IO rather than compute intensive.

In the Node.js API, that general architectural principle of Node.js means that better throughput for large data sets requires multiple concurrent requests to each e-node (instead of serial requests to one e-node).

In other words, while the Node.js client is submitting a request or processing a response, many other pending requests are waiting for the server to respond. Because the round trip over the network is typically much more expensive (especially in the cloud), the client can typically submit many requests, and/or process many responses, during the time required for a single round trip to the server.

This section summarizes the high level concepts for data movement using Node.js.

Optimal Concurrency

The optimal level of concurrency would provide full utilization of both the client and server.

From the client perspective, a new response becomes available to process at the moment that submitting a new request finishes. In essence, neither the single Node.js thread nor the request submitting or response processing routines ever wait. From the server perspective, thread and memory consumption is at the sweet spot, with allowance for other requests to the server. Clients should avoid exceeding the optimum concurrency level for either client or server.

Detection of Server Factors

To determine the appropriate level of concurrency, the client must become aware of server capacity, as reflected by the number of hosts for the database, the number of threads available on those hosts, and (for query management) the number of forests. The Node.js API objects for data movement calls the internal endpoints to inspect server state during initialization.

IO With Node.js Streams

Node.js provides streams as the standard representation for large data sets. Conforming to this standard, the Node.js API data movement functions are factories that return:

  1. An object of type stream.Writable to the application for sending request input to the server.
  2. An object of type stream.Readable to the application for receiving response output from the server.

Typically, the Node.js API reads the input stream repeatedly, accumulating a batch of data in a buffer, and then making a batch request when the buffer is full or the input stream ends.

Typically, the Node.js API writes each item in a response batch separately to the output stream, and ends the output stream when the last response has been processed.

Where the data movement requires both input and output, the factory function returns: a duplex stream to the application, for sending request input to the server, and receiving response output from the server. By doing this, the client implementation in the Node.js API has a readable stream for receiving the request input from the application and a writable stream for sending the response output to the application.

Data Movement Functions

When using multiple data movement functions in a pipeline to handle special cases (instead of the provided conveniences), the application has the responsibility for configuring each function to share the available client and server concurrency.

Data movement functions take options such as - the batch size, the number of concurrent requests per forest or host, success and/or error callbacks for the batch. Each data movement function maintains state for its operations (similar to the use of the Operation object for single-request calls).

When processing data in memory, clients need to work with request or response data as JavaScript in-memory objects. Alternatively when, dispatching request or response data from other sources or sinks (such as other databases), clients can achieve better throughput by working with request or response data as JavaScript strings or buffers.

In particular, most of the existing request functions return a ResultProvider, to let the application choose whether to get the response data as a Node.js Promise or as a Node.js Stream.

By contrast, a data movement function must:

  1. Write response data to the output stream
  2. Execute an application callback to determine the disposition of any error on a batch request

Node-client-api - 2.8.0

These features are introduced with the 2.8.0 release of the Node client APIs.

Ingesting Documents using - writeAll API

The Node.js API documents object adds a writeAll function equivalent to the DMSDK WriteBatcher with the following signature:

writeAll (options)

The parameters:

Parameter Data Type Purpose
options JavaScript literal object Configures the write operation

The return value:

A stream. Writable in object mode that receives document descriptor input from the application for writing to the database. The properties of the options object:

Properties Data Type Purpose
onBatchSuccess function(progress, documents)

Notifying other systems about batches written successfully.

Takes parameters of:

  • A progess object with properties for the running totals of documents written and the elapsed time
  • An array with the document descriptors for the written batch

Any return value is ignored.

onBatchError function(progress, documents, error)

Responding to any error while writing a batch of documents.

Takes parameters of:

  • A progress object with properties for the running totals of documents written and failed, the elapsed time, and the number of failed attempts for the current batch
  • An array with the document description descriptors for the batch that failed to write
  • The errror that occurred.

Controls the resolution of the error by:

  • Returning a non-empty array of document descriptors to try to write as this batch
  • Returning null or an empty array to skip this batch
  • Throwing an error to stop reading the input stream and writing documents; when the last request finishes, finalize the job
onCompletion function(progress, documents)

Providing a summary of the results.

Takes parameters of:

  • A summary object with properties for the total documents written and failed, the elapsed time, and whether processing was cancelled by an error
batchSize int

The number of documents in each batch.

Defaults to the number recommended by Performance and Engineering (possibly 200).

concurrentRequests JavaScript object literal

Controls the maximum number of pending concurrent requests.

Specified as a multiple of either forests or hosts with the following properties:

  • multipleOf - either forests or hosts
  • multiplier - an int

Defaults to the setting recommended by Performance Engineering (possibly 4 per forest).

defaultMetadata JavaScript object literal

An object with any or all of the metatdata properties from the document descriptor including collections, permissions, properties, quality, and metadataValues.

If supplied, the default metadata is passed as the first item in every batch.

transform transformSpecification The same as the transform parameter of the documents.write()

Examples and JavaScript Docs

An example for using the writeAll API has been added to the GitHub examples folder on node-client-api - https://github.com/marklogic/node-client-api/blob/develop/examples/writeAll-documents.js

JavaScript docs - https://docs.marklogic.com/jsdoc/documents.html#writeAll

Node-client-api - 2.9.0

These APIs are introduced with the 2.9.0 release of the Node client APIs.

Collecting Document uris - queryAll API

The Node.js API documents object adds a queryAll() function, equivalent to the DMSDK QueryBatcher, with the following signature:

queryAll(query, options)

The parameters:

Parameter Data Type Purpose
query ctsQuery object a query built by the ctsQuery Builder
options JavaScript literal object configures the query operation

The return value: a stream.Readable that sends document URI output read from the database to the application in string mode or (for arrays of strings) object mode.

The properties of the options object:

Properties Data Type Purpose
onCompletion function(summary)

Providing a summary of the results.

Takes parameters of:

  • A summary object with properties for the total document URIs read, the total number of failed requests, the elapsed time, the timestamp of the snapshot (if set), and whether processing was cancelled by an error.
batchSize int

Controls whether the output stream is

  • a string streamof document URIs; specified with a batchSize of 1 (the default)
  • an object stream of document URIs arrays of the batchSize

Streaming arrays of strings will likely be more performant when piping the output to another data movement function, in part because each request will read from a single forest.

Throws an error if larger than the maximum batch size (possibly 100,000) as determined by Performance Engineering

queryBatchMultiple int

The number of output stream batches to collect in each query.

Defaults based on the configured batchSize and the optimal query batch size (possibly 1000) as recommended by Performance Engineering:

  • if the batchSize is larger than the optimal size, uses the batchSize
  • otherwise, performs integer division of the optimal size by the batchSize
onInitialTimestamp function(timestamp)

Receives the timestamp from the first request.

If the consistentSnapshot option is not set to true, throws an error.

Takes parameters of:

  • The Timestamp object for the server timestamp for the result set.
consistentSnapshot boolean|Timestamp

Controls whether to get an immutable view of the result set:

  • false (the default) - each request gets the latest data at the time of the request
  • true - uses the timestamp of the first request for all subsequent requests
  • a Timestamp object - uses the supplied timestamp for all requests

The existing DatabaseClient.createTimestamp function creates a Timestamp object.

Examples and JavaScript Docs

An example for using the queryAll API has been added to the Github examples folder for node-client-api - https://github.com/marklogic/node-client-api/blob/develop/examples/queryAll-documents.js

MarkLogic Node.js Client API docs - https://docs.marklogic.com/jsdoc/documents.html

Exporting Documents - readAll API

The Node.js API documents object adds a readAll() function equivalent to the DMSDK ExportListener with the following signature:

readAll(options)

The parameters:

Parameter Data Type Purpose
options JavaScript literal object configures the read operation

The return value: a stream.Duplex that receives document URI input from the application in string mode or (for arrays of strings) object mode and sends document descriptors with content and/or the document URI as output to the application in object mode.

The properties of the options object:

Properties Data Type Purpose
onBatchError function(progress, uris, error)

Responding to any error while reading a batch of documents.

Takes parameters of:

  • A progress object with properties for the running totals of documents read and failed, the elapsed time, and the number of failed attempts for the current batch
  • An array with the document uris for the batch that failed to read
  • The error that occurred

Controls the resolution of the error by:

  • Returning a non-empty array of document URIs to try to read as this batch
  • Returning null or an empty array to skip this batch
  • Throwing an error to terminate the stream; when the last request finishes, finalize the job.
onCompletion function(summary)

Providing a summary of the results.

Takes parameters of:

  • A summary object with properties for the total documents read, the total number of failed requests, the elapsed time, the timestamp of the snapshot (if set), and whether processing was cancelled by an error.
inputKind 'string'|'array' Indicates whether the input stream should accept:
  • individual strings in string mode (the default)
  • arrays of strings in object mode
batchSize int

The number of documents in each request where the inputKind is 'string'

Defaults to the number recommended by Performance Engineering (possibly 250).

Throws an error if the inputKind is 'array'

concurrentRequests JavaScript object literal

Controls the maximum number of concurrent requests that can be pending at the same time.

Specified as a multiple of either forests or hosts with the following properties:

  • multipleOf = either "forests" or "hosts"
  • multiplier = an int

Defaults to the setting recommended by Performance Engineering (possibly 8 per forest).

categories categoriesSpecification

Specifies what data to retrieve from the server and write to the output stream.

The option has the same enumeration as the categories parameter of documents.read() - see https://docs.marklogic.com/jsdoc/documents.html#.categories.

transform transformSpecification The same as the transform parameter of documents.read()
consistentSnapshot boolean|Timestamp

Controls whether to get an immutable view of the result set:

  • false (the default) - each request gets the latest data at the time of the request
  • true - uses the timestamp of the first request for all subsequent requests
  • a Timestamp object - uses the supplied timestamp for all requests

The existing DatabaseClient.createTimestamp function creates a Timestamp object.

onInitialTimestamp function(timestamp)

Receives the timestamp from the first request.

If the consistentSnapshot option is not set to true, throws an error.

Takes parameters of:

  • The Timestamp object for the server timestamp for the result set.
outputStreamType 'chunked'| 'object' Same as the parameter of the existing ResultProvider.stream() function.

Examples and JavaScript Docs

An example for using the readAll API has been added to the Github examples folder for node-client-api: https://github.com/marklogic/node-client-api/blob/develop/examples/readAll-documents.js

MarkLogic Node.js Client API docs: https://docs.marklogic.com/jsdoc/documents.html

queryToReadAll

queryToReadAll - a convenience function that combines the query and read operations

The Node.js API documents object adds a queryToReadAll() convenience function that combines the query and read operations.

This function has the following signature:

queryToReadAll(query, options)

The parameters: the same as the queryAll() and readAll() functions.

The return value: a stream.Readable in object mode that returns document descriptors with the content and/or document uri as output to the application in object mode.

The properties of the options object are the same as those of the queryAll() function except:

  • the batchSize defaults to the readAll() default instead of 1

Example and JS Docs

An example for using the readAll API has been added to the examples folder on node-client-api: https://github.com/marklogic/node-client-api/blob/master/examples/queryToReadAll-documents.js

JS docs: https://docs.marklogic.com/jsdoc/documents.html#queryToReadAll

Node-client-api - 3.0.0

These APIs are introduced with the 3.0.0 release of the Node client APIs:

Reprocessing Documents - transformAll API

The Node.js API documents object adds a transformAll() function equivalent to the DMSDK ApplyTransformListener with the following signature:

transformAll(stream,options)

The parameters:

Parameter Data Type Purpose
*stream Stream Provides the input to the transformAll API.
options JavaScript literal object Configures the transform operation.

The return value: a stream.Writable that receives document URI input from the application in string mode or (for arrays of strings) object mode.

The properties of the options object:

Properties Data Type Purpose
onBatchSuccess function(progress, documents)

Notifying other systems about batches transformed successfully.

Takes parameters of:

  • A progress object with properties for the running totals of documents transformed and failed and the elapsed time
  • An array with the document URIs for the transformed batch

Any return value is ignored.

onBatchError function(progress, uris, error)

Responding to any error while transforming a batch of documents.

Takes parameters of:

  • A progress object with properties for the running totals of documents transformed and failed, the elapsed time, and the number of failed attempts for the current batch
  • An array with the document uris for the batch that failed to transform
  • The error that occurred

Controls the resolution of the error by:

  • Returning a non-empty array of document uris to try to transform as this batch
  • Returning null or an empty array to skip this batch
  • Throwing an error to stop reading the input stream and transforming documents; when the last request finishes, finalize the job
onCompletion function(summary)

Providing a summary of the results.

Takes parameters of:

  • A summary object with properties for the total documents transformed, the total number of failed requests, the elapsed time, and whether processing was cancelled by an error.
inputKind 'string'|'array' Indicates whether the input stream should accept
  • individual strings in string mode (the default)
  • arrays of strings in object mode
batchSize int

The number of documents in each request where the inputKind is 'string'

Defaults to the number recommended by Performance Engineering (possibly 100).

Throws an error if the inputKind is 'array'

concurrentRequests JavaScript object literal

Controls the maximum number of concurrent requests that can be pending at the same time.

Specified as a multiple of either forests or hosts with the following properties:

  • multipleOf = either "forests" or "hosts"
  • multiplier = an int

Defaults to the setting recommended by Performance Engineering (possibly 8 per forest).

transformStrategy 'replace'|'ignore'

Controls whether the output from the transform replaces the content at the document URI.

The ignore strategy allows the transform to persist output in the database without changing the original document.

transform transformSpecification The same as the transform parameter of documents.write()

Example and JS Docs for transformAll API

An example for using the transformAll API has been added to the examples folder on node-client-api: https://github.com/marklogic/node-client-api/blob/develop/examples/transformAll-documents.js

JS docs: https://docs.marklogic.com/jsdoc/documents.html

queryToTransformAll

The Node.js API documents object adds a queryToTransformAll() convenience function that combines the query and transform operations.

This function has the following signature:

queryToTransformAll(query, options)

The parameters: the same as the queryAll() and transformAll() functions.

The return value: None

The properties of the options object are the same as those of the queryAll() and transformAll() functions except:

  • the batchSize defaults to the transformAll() default instead of 1

Example and JS Docs for queryToTransformAll API

An example for using the queryToTransformAll API has been added to the examples folder on node-client-api: https://github.com/marklogic/node-client-api/blob/develop/examples/queryToTransformAll-documents.js

JS docs:https://docs.marklogic.com/jsdoc/documents.html

Deleting Documents - removeAllUris API

The Node.js API documents object adds a removeAllUris() function equivalent to the DMSDK DeleteListener with the following signature:

removeAllUris(stream,options)

The parameters:

Parameter Data Type Purpose
*stream Stream Provides the input to the removeAllUris API.
options JavaScript literal object Configures the remove operation.

The return value: a stream.Writable that receives document URI input from the application in string mode or (for arrays of strings) object mode.

The properties of the options object:

Properties Data Type Purpose
onBatchSuccess function(progress, documents)

Notifying other systems about batches transformed successfully.

Takes parameters of:

  • A progress object with properties for the running totals of documents transformed and failed and the elapsed time
  • An array with the document URIs for the transformed batch

Any return value is ignored.

onBatchError function(progress, uris, error)

Responding to any error while transforming a batch of documents.

Takes parameters of:

  • A progress object with properties for the running totals of documents transformed and failed, the elapsed time, and the number of failed attempts for the current batch
  • An array with the document uris for the batch that failed to transform
  • The error that occurred

Controls the resolution of the error by:

  • Returning a non-empty array of document uris to try to transform as this batch
  • Returning null or an empty array to skip this batch
  • Throwing an error to stop reading the input stream and transforming documents; when the last request finishes, finalize the job
onCompletion function(summary)

Providing a summary of the results.

Takes parameters of:

  • A summary object with properties for the total documents transformed, the total number of failed requests, the elapsed time, and whether processing was cancelled by an error.
inputKind 'string'|'array' Indicates whether the input stream should accept
  • individual strings in string mode (the default)
  • arrays of strings in object mode
batchSize int

The number of documents in each request where the inputKind is 'string'

Defaults to the number recommended by Performance Engineering (possibly 250).

Throws an error if the inputKind is 'array'

concurrentRequests JavaScript object literal

Controls the maximum number of concurrent requests that can be pending at the same time.

Specified as a multiple of either forests or hosts with the following properties:

  • multipleOf = either "forests" or "hosts"
  • multiplier = an int

Defaults to the setting recommended by Performance Engineering (possibly 8 per forest).

Example and JS Docs for removeAllUris API

An example for using the transformAll API has been added to the examples folder on node-client-api: https://github.com/marklogic/node-client-api/blob/develop/examples/removeAllUris-documents.js

JS docs: https://docs.marklogic.com/jsdoc/documents.html

queryToRemoveAll

The Node.js API documents object adds a queryToRemoveAll() convenience function that combines the queryAll and removeAllUris operations. This function has the following signature:

queryToRemoveAll(query, options)

The parameters: the same as the queryAll() and removeAllUris() functions

The return value: None

The properties of the options object are the same as those of the queryAll() and removeAllUris() functions except:

  • the batchSize defaults to the removeAllUris() default instead of 1

Example and JS Docs for queryToRemoveAll API

An example for using the queryToRemoveAll API has been added to the examples folder on node-client-api: https://github.com/marklogic/node-client-api/blob/develop/examples/queryToRemoveAll-documents.js

JS docs - https://docs.marklogic.com/jsdoc/documents.html

Exporting Rows - queryAll API

The Node.js API rows object adds a queryAll() function equivalent to the DMSDK RowBatcher with the following signature:

queryAll(batchView, options)

The parameters:

Parameter Data Type Purpose
batchView PlanBuilder.ModifyPlan|object|string A query that exports a modified subset of rows from a view expressed as any of:
  • A modify plan without parameters built by the PlanBuilder
  • A JavaScript literal object equivalent to a JSON AST representation of the modify plan
  • A string literal with the JSON AST or Query DSL representation of the modify plan
options JavaScript literal object Configures the query operation.

The return value:

a stream.Readable that sends rows read from the database to the application in the configured mode

The properties of the options object:

Properties Data Type Purpose
onBatchError function(progress, error)

Responding to any error while while reading a batch of rows.

Takes parameters of:

  • A progress object with properties for the running totals of rows read and failed requests, the elapsed time, and the number of failed attempts for the current request
  • The error that occurred

Controls the resolution of the error by:

  • Returning true to try to read this batch
  • Returning false to skip this batch
  • Throwing an error to terminate the stream; when the last request finishes, finalize the job
onCompletion function(summary)

Providing a summary of the results.

Takes parameters of:

  • A summary object with properties for the total rows read, the total number of failed requests, the elapsed time, and the timestamp of the snapshot (if set), and whether processing was cancelled by an error.
batchSize int

Controls the number of rows retrieved in each request

Streaming arrays of strings will likely be more performant when piping the output to another other data movement function, in part because each request will read from a single forest.

Throws an error if larger than the maximum batch size (possibly 100,000) as determined by Performance Engineering

concurrentRequests JavaScript object literal

Controls the maximum number of concurrent requests that can be pending at the same time.

Specified as a multiple of either forests or hosts with the following properties:

  • multipleOf = either "forests" or "hosts"
  • multiplier = an int

Defaults to the setting recommended by Performance Engineering (possibly 4 per forest).

consistentSnapshot boolean|Timestamp Controls whether to get an immutable view of the result set:
  • false (the default) - each request gets the latest data at the time of the request
  • true - uses the timestamp of the first request for all subsequent requests
  • a Timestamp object - uses the supplied timestamp for all requests
The existing DatabaseClient.createTimestamp() function creates a Timestamp object.
onInitialTimestamp function(timestamp)

Receives the timestamp from the first request.

If the consistentSnapshot option is not set to true, throws an error

Takes parameters of:

  • The Timestamp object for the server timestamp for the result set.
queryType 'json'|'dsl'

Identifies the representation for a string batch view.

Throws an error for a ModifyPlan or object batch view.

Similar to the RowsOptions.queryType property of the existing rows.query() function.

columnTypes 'rows'|'header' Same as the RowsOptions.columnTypes property of the existing rows.query() function.
rowStructure 'object'|'array' Same as the RowsOptions.structure property of the existing rows.query() function.
rowFormat 'json'|'xml'|'csv' Same as the RowsOptions.format property of the existing rows.query() function.
outputStreamType 'chunked'| 'object'|'sequence Same as the streamType parameter of the existing rows.queryAsStream() function.

Example and JS Docs

An example for using the queryAll API has been added to the examples folder on node-client-api - https://github.com/marklogic/node-client-api/blob/develop/examples/queryAll-rows.js

JS docs - https://docs.marklogic.com/jsdoc/rows.html

Signature Changes

writeAll -

The Node.js writeAll function adds an additional parameter and the new API will be as follows:

writeAll(stream, options)

The parameters:

Parameter Data Type Purpose
*stream Stream Provides the input to the writeAll API.
options JavaScript literal object Configures the write operation.

The return value and options parameter remains the same.

readAll -

The Node.js readAll function adds an additional parameter and the new API will be as follows:

readAll(stream, options)

The parameters:

Parameter Data Type Purpose
*stream Stream Provides the input to the readAll API.
options JavaScript literal object Configures the read operation.

The return value and options parameter remains the same.

« Previous chapter