Create Steps Using Gradle - HC Format

Overview

A typical Data Hub data flow involves the following operations:
  1. Load/Ingest your raw data into MarkLogic Server.
  2. Create an entity model to standardize your data fields.
  3. Map the fields in your raw data to the fields of the entity model.
  4. (Optional) Match and merge duplicates.

You can use Gradle tasks to create a flow configuration file in HC format and deploy it to DHS. Then you can run the steps within Hub Central.

You can customize any of these steps by adding interceptors. You can also replace the default steps entirely by creating a custom step that uses your own custom module.

After deployment, the steps appear in Hub Central as follows:

Step Type Hub Central area
  • Loading
  • Custom-Ingestion
Load
  • Mapping
  • Matching
  • Merging
Curate
  1. In the list of entity types,
    • If the step is associated with an entity, expand the associated entity type.
    • If the step is not associated with any entity, expand No Entity Type.
  2. Open the appropriate tab according to the step type.
Custom
Note: Other custom steps (created as Custom-Mapping, Custom-Mastering, or Custom-Other in QuickStart) are typed simply as Custom in Hub Central.

Before you begin

You need:

You must be assigned the following security roles:

  • In your local environment:
    • To create flows and steps in Gradle: data-hub-developer
  • In your DHS environment:
    • To view, create, edit, or delete a step: Hub Central Developer or Hub Central Curator
    • To view an existing Custom step (converted from Custom-Mapping, Custom-Mastering, or Custom-Other): Hub Central Modeler, Hub Central Developer, Hub Central Operator, or Hub Central Curator
    • To add a step to a flow: Hub Central Developer or Hub Central Curator
    • To run a step: Hub Central Operator or Hub Central Curator

Or any role that inherits the required role. See Users and Roles.

About this task

Before creating a step using Gradle, you need:

Procedure

  1. Using Gradle, create the step.

    At your project root, run the Gradle task hubCreateStep.

    ./gradlew hubCreateStep -PstepName=yourstepname -PstepType=[ingestion|mapping|matching|merging|custom] -PstepDefName=yourstepdefinitionname -PentityType=myEntityTypeName -igradlew.bat hubCreateStep -PstepName=yourstepname -PstepType=[ingestion|mapping|matching|merging|custom] -PstepDefName=yourstepdefinitionname -PentityType=myEntityTypeName -i
    stepName
    (Required) The name of the step to create based on a step definition.
    stepType
    (Required) The type of step to create: ingestion, mapping, matching, merging, or custom. For Custom-Ingestion, use ingestion and specify stepDefName.
    stepDefName
    The name of the step definition to create. Allowed only if stepType is ingestion or custom. The specified step definition and its associated module are created and used.
    entityType
    (Required if stepType is mapping) The name of the entity type to associate with the step.

    If you run this task while connected to Data Hub in DHS, the resulting artifacts are automatically deployed. If not connected, a connection exception is thrown.

  2. Configure the settings in the new step.
  3. For Custom-Ingestion and Custom steps, edit the created module.
  4. Deploy your step and other artifacts to DHS.

    At your project root, run the Gradle task hubDeploy or hubDeployAsDeveloper.

    Learn more about hubDeploy and hubDeployAsDeveloper.

  5. Add the step to your flow.

    At your project root, run the Gradle task hubAddStepToFlow.

    ./gradlew hubAddStepToFlow -PflowName=yourflowname -PstepName=yourstepname -PstepType=[ingestion|mapping|matching|merging|custom] -igradlew.bat hubAddStepToFlow -PflowName=yourflowname -PstepName=yourstepname -PstepType=[ingestion|mapping|matching|merging|custom] -i
    flowName
    (Required) The name of the flow to add the step to.
    stepName
    (Required) The name of the step to create.
    stepType
    (Required) The type of step to add to the flow: ingestion, mapping, matching, merging, or custom.

    If you run this task while connected to Data Hub in DHS, the resulting artifacts are automatically deployed. If not connected, a connection exception is thrown.

    Important: The Matching step must be executed before the Merging step. The Gradle task hubAddStepToFlow adds each new step reference to the end of the step sequence. If your steps are not added in the correct order, you can edit the sequence numbers of the steps in the flow configuration. Each sequence number must be unique within the flow configuration.
  6. Redeploy to DHS as needed.

What to do next

  1. (Optional) To perform other tasks outside the Data Hub space, you can add interceptors to the step.
  2. Run the flow.