Editing a Custom Step Module
Custom Step Modules
A custom step allows you to add your own custom functionality anywhere in the flow sequence.
To perform specialized tasks in Data Hub, you can create a custom step that calls a custom step module. If created by any of the following methods, the step module contains instructions to help you customize it.
Required Inputs
Your module must accept the following as input:
content
object. A JSON object that contains everything that you want to process. If the step property acceptsBatch istrue
, you can pass an array ofContent
objects, one per document or record.Each
Content
object consists of the following:Component Value by default Value if the step property sourceQueryIsScript is true
content.uri
The URI of the document or record to process. The items returned by the code specified in the step property sourceQuery. content.value
The document associated with the URI specified in content.uri
.Nothing. content.context
All the metadata associated with the document or record (found in the database, but not included in the envelope). Note: If ingesting using Hub Central, QuickStart, Gradle, or the REST API,content.context
is not available and, therefore, the associated metadata cannot be modified.originalCollections
. An array of strings, which are the collection tags already associated with the document.collections
. An array of strings, which are the collection tags to associate with the document.permissions
. An array of objects, which are the permissions required to access the processed document after it is saved to the database. Each object includes the properties roleId and capability.metadata
. Key-and-value pairs containing metadata to be persisted with the document.- Additional properties, if ingesting using MLCP.
Nothing. options
object. A JSON object containing properties from the following sources:- The
options
parameter specified, if any, when using Gradle or the Data Hub Client JAR to run the step that calls the custom module. - The
options
object from the step that calls the custom module. - The
options
object from the flow containing the step that calls the custom module. - The
options
object from the step definition associated with the step instance that calls the custom module.
If the same properties occur in two or more sources, the source that is higher in the list takes precedence.
Note: The step configuration in Hub Central format does not include theoptions
object. If you run a Hub Central-formatted step, the appropriate properties are automatically copied to an internaloptions
object, which is handled the same way as theoptions
object in the QuickStart format.- The
Required Outputs
Your module must return the following:
- For a Custom-Ingestion step, a
Content
object. - For all other custom steps, a
Content
object or an array ofContent
objects.
Each Content
object must contain the processed data to be saved to the database and must consist of the following:
Component | Value |
---|---|
content.uri |
The URI of the document to save to the database. If a document with the same URI already exists, it will be overwritten. |
content.value |
The document to save at the URI specified by content.uri . |
content.context |
All the metadata to associate with the document or record.
|
content.provenance |
(Optional) Additional property-level provenance information to store if the provenance granularity is set to fine . |
Provenance in a Custom Step
You can choose to track property-level provenance information, in addition to the default document-level provenance information. See Set Provenance Granularity Manually.
In a custom step, you can also specify which property-level provenance information is tracked. To do so,
- The
Content
objects returned by your custom module must have acontent.provenance
component. content.provenance
must contain the properties that you want to track and their values.- The value of
content.provenance
must be in the following format. Data Hub will convert it to the PROV-XML schema before storing it in the JOBS database.{ "<originalURI>": { "<originalXPathOrPropertyName>": { "destination": "<XPathOrPropertyInNewDocument>", "value": "<newValue>" } } }
Example 1: If you mapped the lastName
property to the surName
property, you can set content.provenance
to the following:
{
"/26451baa-fe14-471f-bd77-364ac3f64c82.json": {
"lastName": {
"destination": "surName",
"value": "Smith"
}
}
}
Example 2: If your custom module pulled information from multiple documents into the current one, you can combine the provenance information of the source documents into a single content.provenance
.
{
"/26451baa-fe14-471f-bd77-364ac3f64c82.json": {
"lastName": {
"destination": "surName",
"value": "Smith"
}
},
"/5455fd37-6d96-4883-9349-8e79fa700145.json": {
"firstName": {
"destination": "givenName",
"value": "John"
}
}
}
content.provenance
is not in the Content
objects returned by your custom module and granularity is set to fine
for the step, only the default document-level provenance information will be tracked (same as coarse
). No error is thrown.Best Practices
- Although you can code your custom module in XQuery, MarkLogic recommends using JavaScript.
- Use the
DataHub
object, which gives you access to the Data Hub libraries. For example, theDataHub
object can generate an envelope around an XML or a JSON document. - You can handle errors in two ways:
- If you are using an orchestration application (e.g., NiFi),
- You can throw an error inside the module. Every thrown error is reported back to the orchestrator, where it is logged with the URI of the document that failed.
- In another step (which can also be in another flow), you can search the orchestrator log for documents with a specific error and fix them accordingly.
- Instead of throwing an error,
- You can add a special collection tag to the document that failed.
- In another step, you can search for documents with that collection tag and fix them accordingly.
- If you are using an orchestration application (e.g., NiFi),