Editing a Custom Step Module

Custom Step Modules

A custom step allows you to add your own custom functionality anywhere in the flow sequence.

To perform specialized tasks in Data Hub, you can create a custom step that calls a custom step module. If created by any of the following methods, the step module contains instructions to help you customize it.

Required Inputs

Your module must accept the following as input:

  • content object. A JSON object that contains everything that you want to process. If the step property acceptsBatch is true, you can pass an array of Content objects, one per document or record.

    Each Content object consists of the following:

    Component Value by default Value if the step property sourceQueryIsScript is true
    content.uri The URI of the document or record to process. The items returned by the code specified in the step property sourceQuery.
    content.value The document associated with the URI specified in content.uri. Nothing.
    content.context All the metadata associated with the document or record (found in the database, but not included in the envelope).
    Note: If ingesting using Hub Central, QuickStart, Gradle, or the REST API, content.context is not available and, therefore, the associated metadata cannot be modified.
    • originalCollections. An array of strings, which are the collection tags already associated with the document.
    • collections. An array of strings, which are the collection tags to associate with the document.
    • permissions. An array of objects, which are the permissions required to access the processed document after it is saved to the database. Each object includes the properties roleId and capability.
    • metadata. Key-and-value pairs containing metadata to be persisted with the document.
    • Additional properties, if ingesting using MLCP.
    Nothing.
  • options object. A JSON object containing properties from the following sources:
    • The options parameter specified, if any, when using Gradle or the Data Hub Client JAR to run the step that calls the custom module.
    • The options object from the step that calls the custom module.
    • The options object from the flow containing the step that calls the custom module.
    • The options object from the step definition associated with the step instance that calls the custom module.

    If the same properties occur in two or more sources, the source that is higher in the list takes precedence.

    Note: The step configuration in Hub Central format does not include the options object. If you run a Hub Central-formatted step, the appropriate properties are automatically copied to an internal options object, which is handled the same way as the options object in the QuickStart format.

Required Outputs

Your module must return the following:

  • For a Custom-Ingestion step, a Content object.
  • For all other custom steps, a Content object or an array of Content objects.

Each Content object must contain the processed data to be saved to the database and must consist of the following:

Component Value
content.uri The URI of the document to save to the database. If a document with the same URI already exists, it will be overwritten.
content.value The document to save at the URI specified by content.uri.
content.context All the metadata to associate with the document or record.
  • collections. An array of strings, which are the collection tags to associate with the document.
  • permissions. An array of objects, which are the permissions required to access the processed document after it is saved to the database. Each object includes the properties roleId and capability.
  • metadata. Key-and-value pairs containing metadata to be persisted with the document.
  • Additional properties, if ingesting using MLCP.
content.provenance (Optional) Additional property-level provenance information to store if the provenance granularity is set to fine.

Provenance in a Custom Step

You can choose to track property-level provenance information, in addition to the default document-level provenance information. See Set Provenance Granularity Manually.

In a custom step, you can also specify which property-level provenance information is tracked. To do so,

  • The Content objects returned by your custom module must have a content.provenance component.
  • content.provenance must contain the properties that you want to track and their values.
  • The value of content.provenance must be in the following format. Data Hub will convert it to the PROV-XML schema before storing it in the JOBS database.
       {
        "<originalURI>": {
          "<originalXPathOrPropertyName>": {
           "destination": "<XPathOrPropertyInNewDocument>",
           "value": "<newValue>"
          }
        }
      }
    

Example 1: If you mapped the lastName property to the surName property, you can set content.provenance to the following:

   {
    "/26451baa-fe14-471f-bd77-364ac3f64c82.json": {
      "lastName": {
       "destination": "surName",
       "value": "Smith"
      }
    }
  }

Example 2: If your custom module pulled information from multiple documents into the current one, you can combine the provenance information of the source documents into a single content.provenance.

   {
    "/26451baa-fe14-471f-bd77-364ac3f64c82.json": {
      "lastName": {
        "destination": "surName",
        "value": "Smith"
      }
    },
    "/5455fd37-6d96-4883-9349-8e79fa700145.json": {
      "firstName": {
        "destination": "givenName",
        "value": "John"
      }
    }
  }
Note: If content.provenance is not in the Content objects returned by your custom module and granularity is set to fine for the step, only the default document-level provenance information will be tracked (same as coarse). No error is thrown.

Best Practices

  • Although you can code your custom module in XQuery, MarkLogic recommends using JavaScript.
  • Use the DataHub object, which gives you access to the Data Hub libraries. For example, the DataHub object can generate an envelope around an XML or a JSON document.
  • You can handle errors in two ways:
    • If you are using an orchestration application (e.g., NiFi),
      1. You can throw an error inside the module. Every thrown error is reported back to the orchestrator, where it is logged with the URI of the document that failed.
      2. In another step (which can also be in another flow), you can search the orchestrator log for documents with a specific error and fix them accordingly.
    • Instead of throwing an error,
      1. You can add a special collection tag to the document that failed.
      2. In another step, you can search for documents with that collection tag and fix them accordingly.