Editing a Custom Step Module
Custom Step Modules
A custom step allows you to add your own custom functionality anywhere in the flow sequence.
To perform specialized tasks in Data Hub, you can create a custom step that calls a custom step module. If created by any of the following methods, the step module contains instructions to help you customize it.
Required Inputs
Your module must accept the following as input:
contentobject. A JSON object that contains everything that you want to process. If the step property acceptsBatch istrue, you can pass an array ofContentobjects, one per document or record.Each
Contentobject consists of the following:Component Value by default Value if the step property sourceQueryIsScript is truecontent.uriThe URI of the document or record to process. The items returned by the code specified in the step property sourceQuery. content.valueThe document associated with the URI specified in content.uri.Nothing. content.contextAll the metadata associated with the document or record (found in the database, but not included in the envelope). Note: If ingesting using Hub Central, Gradle, or the REST API,content.contextis not available and, therefore, the associated metadata cannot be modified.originalCollections. An array of strings, which are the collection tags already associated with the document.collections. An array of strings, which are the collection tags to associate with the document.permissions. An array of objects, which are the permissions required to access the processed document after it is saved to the database. Each object includes the properties roleId and capability.metadata. Key-and-value pairs containing metadata to be persisted with the document.- Additional properties, if ingesting using MLCP.
Nothing. optionsobject. A JSON object containing properties from the following sources:- The
optionsparameter specified, if any, when using Gradle or the Data Hub Client JAR to run the step that calls the custom module. - The
optionsobject from the step that calls the custom module. - The
optionsobject from the flow containing the step that calls the custom module. - The
optionsobject from the step definition associated with the step instance that calls the custom module.
If the same properties occur in two or more sources, the source that is higher in the list takes precedence.
- The
Required Outputs
Your module must return the following:
- For a Custom-Ingestion step, a
Contentobject. - For all other custom steps, a
Contentobject or an array ofContentobjects.
Each Content object must contain the processed data to be saved to the database and must consist of the following:
| Component | Value |
|---|---|
content.uri |
The URI of the document to save to the database. If a document with the same URI already exists, it will be overwritten. |
content.value |
The document to save at the URI specified by content.uri. |
content.context |
All the metadata to associate with the document or record.
|
content.provenance |
(Optional) Additional property-level provenance information to store if the provenance granularity is set to fine. |
Provenance in a Custom Step
You can choose to track property-level provenance information, in addition to the default document-level provenance information. See Set Provenance Granularity Manually.
In a custom step, you can also specify which property-level provenance information is tracked. To do so,
- The
Contentobjects returned by your custom module must have acontent.provenancecomponent. content.provenancemust contain the properties that you want to track and their values.- The value of
content.provenancemust be in the following format. Data Hub will convert it to the PROV-XML schema before storing it in the JOBS database.{ "<originalURI>": { "<originalXPathOrPropertyName>": { "destination": "<XPathOrPropertyInNewDocument>", "value": "<newValue>" } } }
Example 1: If you mapped the lastName property to the surName property, you can set content.provenance to the following:
{
"/26451baa-fe14-471f-bd77-364ac3f64c82.json": {
"lastName": {
"destination": "surName",
"value": "Smith"
}
}
}
Example 2: If your custom module pulled information from multiple documents into the current one, you can combine the provenance information of the source documents into a single content.provenance.
{
"/26451baa-fe14-471f-bd77-364ac3f64c82.json": {
"lastName": {
"destination": "surName",
"value": "Smith"
}
},
"/5455fd37-6d96-4883-9349-8e79fa700145.json": {
"firstName": {
"destination": "givenName",
"value": "John"
}
}
}
content.provenance is not in the Content objects returned by your custom module and granularity is set to fine for the step, only the default document-level provenance information will be tracked (same as coarse). No error is thrown.Best Practices
- Although you can code your custom module in XQuery, MarkLogic recommends using JavaScript.
- Use the
DataHubobject, which gives you access to the Data Hub libraries. For example, theDataHubobject can generate an envelope around an XML or a JSON document. - You can handle errors in two ways:
- If you are using an orchestration application (e.g., NiFi),
- You can throw an error inside the module. Every thrown error is reported back to the orchestrator, where it is logged with the URI of the document that failed.
- In another step (which can also be in another flow), you can search the orchestrator log for documents with a specific error and fix them accordingly.
- Instead of throwing an error,
- You can add a special collection tag to the document that failed.
- In another step, you can search for documents with that collection tag and fix them accordingly.
- If you are using an orchestration application (e.g., NiFi),