Editing a Custom Step Module

Custom Step Modules

A custom step allows you to add your own custom functionality anywhere in the flow sequence.

To perform specialized tasks in Data Hub, you can create a custom step that calls a custom step module. The custom step module created by QuickStart when configuring a custom step or by the Gradle task hubCreateStepDefinition contains instructions to help you develop your module.

Required Inputs

  • Content object. An object that contains everything that you want to process. Could be an array of Content objects, one per document or record. Each Content object consists of the following:
    • content.uri. The URI of the document or record to process.
    • content.context. All the metadata associated with the document or record (found in the database, but not included in the envelope). Examples: permissions, collection tags, temporal settings.
    • content.value. The information to process in the custom module.
  • Options object. Custom objects, such as parameters (as JSON key-value parameters) that are passed to the step.

Required Outputs

Step Type Required Outputs
Custom-Ingestion A Content object.
  • Custom-Mapping
  • Custom-Mastering
  • Custom-Other
A Content object or an array of Content objects.
Each Content object contains the processed data to be written to the database and consists of the following:
  • content.uri. The URI of the document or record to overwrite or create in the database. If the document does not already exist, the URI must be unique; otherwise the old data with the same URI will be overwritten and the changes will be logged in the provenance data.
  • content.context. All the metadata to associate with the document or record.
  • content.value. The information to store in the database.
  • content.provenance. (Optional) Additional property-level provenance information to store, if the provenance granularity is set to fine.

Provenance in a Custom Step

You can choose to track property-level provenance information, in addition to the default document-level provenance information. See Set Provenance Granularity Manually.

In a custom step, you can also specify which property-level provenance information is tracked. To do so,

  • The Content objects returned by your custom module must have a content.provenance component.
  • content.provenance must contain the properties that you want to track and their values.
  • The value of content.provenance must be in the following format. Data Hub will convert it to the PROV-XML schema before storing it in the JOBS database.
       {
        "<originalURI>": {
          "<originalXPathOrPropertyName>": {
           "destination": "<XPathOrPropertyInNewDocument>",
           "value": "<newValue>"
          }
        }
      }
    

Example 1: If you mapped the lastName property to the surName property, you can set content.provenance to the following:

   {
    "/26451baa-fe14-471f-bd77-364ac3f64c82.json": {
      "lastName": {
       "destination": "surName",
       "value": "Smith"
      }
    }
  }

Example 2: If your custom module pulled information from multiple documents into the current one, you can combine the provenance information of the source documents into a single content.provenance.

   {
    "/26451baa-fe14-471f-bd77-364ac3f64c82.json": {
      "lastName": {
        "destination": "surName",
        "value": "Smith"
      }
    },
    "/5455fd37-6d96-4883-9349-8e79fa700145.json": {
      "firstName": {
        "destination": "givenName",
        "value": "John"
      }
    }
  }
Note: If content.provenance is not in the Content objects returned by your custom module and granularity is set to fine for the step, only the default document-level provenance information will be tracked (same as coarse). No error is thrown.

Best Practices

  • Although you can code your custom module in XQuery, MarkLogic recommends using JavaScript.
  • Use the DataHub object, which gives you access to the Data Hub libraries. For example, the DataHub object can generate an envelope around an XML or a JSON document.
  • You can handle errors in two ways:
    • If you are using an orchestration application (e.g., NiFi),
      1. You can throw an error inside the module. Every thrown error is reported back to the orchestrator, where it is logged with the URI of the document that failed.
      2. In another step (which can also be in another flow), you can search the orchestrator log for documents with a specific error and fix them accordingly.
    • Instead of throwing an error,
      1. You can add a special collection tag to the document that failed.
      2. In another step, you can search for documents with that collection tag and fix them accordingly.