Data Hub Gradle Tasks

The Gradle tasks available in Data Hub Gradle Plugin (ml-data-hub).

Using Gradle in Data Hub

To use Data Hub Gradle Plugin in the Data Hub flows, see Data Hub Gradle Plugin.

To pass parameters to Gradle tasks, use the -P option.

./gradlew taskname ... -PparameterName=parameterValue ... -igradlew.bat taskname ... -PparameterName=parameterValue ... -i
Important: If the value of a Gradle parameter contains a blank space, you must enclose the value in double quotation marks. If the value does not contain a blank space, you must not enclose the value in quotation marks.

You can use Gradle's -i option to enable info-level logging.

This page provides the list of Gradle tasks available in Data Hub Gradle Plugin (ml-data-hub).
  • Tasks with names starting with ml are customized for Data Hub from the ml-gradle implementation.
  • Tasks with names starting with hub are created specifically for Data Hub.
Tip: You can view the complete list of available Gradle tasks and their descriptions by running gradle tasks.

MarkLogic Data Hub Setup Tasks

These tasks are used to configure MarkLogic Data Hub and manage the data hub.

mlDeploy

Uses hubPreinstallCheck to deploy your Data Hub project.

./gradlew mlDeploy -igradlew.bat mlDeploy -i
mlWatch

Extends ml-gradle's WatchTask by ensuring that modules in Data Hub-specific folders (plugins and entity-config) are monitored.

./gradlew mlWatch -igradlew.bat mlWatch -i
mlUpdateIndexes

Updates the properties of every database without creating or updating forests. Many properties of a database are related to indexing.

./gradlew mlUpdateIndexes -igradlew.bat mlUpdateIndexes -i
hubUpdate

Updates your Data Hub instance to a newer version.

./gradlew hubUpdate -igradlew.bat hubUpdate -i

Before you run the hubUpdate task, edit the build.gradle file. Under plugins, change the value of 'com.marklogic.ml-data-hub' version to the new Data Hub version.

For example, if you are updating to Data Hub 5.1.0,
 plugins {
  id 'com.marklogic.ml-data-hub' version '5.1.0'
}

For complete instructions on upgrading to a newer Data Hub version, see Upgrading Data Hub.

Running the hubUpdate task with the -i option (info mode) displays specifically what the task does, including configuration settings that changed.

hubInfo

Prints out basic info about the Data Hub configuration.

./gradlew hubInfo -igradlew.bat hubInfo -i
hubDeployUserArtifacts

Installs user artifacts, such as entities and mappings, to the MarkLogic server. (Data Hub 4.2 or later)

./gradlew hubDeployUserArtifacts -igradlew.bat hubDeployUserArtifacts -i

MarkLogic Data Hub Scaffolding Tasks

These tasks allow you to scaffold projects, entities, flows, and steps.

hubInit

Initializes the current directory as a Data Hub project.

./gradlew hubInit -igradlew.bat hubInit -i
hubCreateEntity

Creates a boilerplate entity.

./gradlew hubCreateEntity -PentityName=YourEntityName -igradlew.bat hubCreateEntity -PentityName=YourEntityName -i
entityName
(Required) The name of the entity to create.
hubCreateFlow

Creates a boilerplate flow definition file.

./gradlew hubCreateFlow -PflowName=YourFlowName -igradlew.bat hubCreateFlow -PflowName=YourFlowName -i
flowName
(Required) The name of the flow to create.
hubCreateStepDefinition

Creates a custom step definition that can be added to a flow as a step.

./gradlew hubCreateStepDefinition -PstepDefName=yourstepname -PstepDefType=yoursteptype -Pformat=[sjs|xqy] -igradlew.bat hubCreateStepDefinition -PstepDefName=yourstepname -PstepDefType=yoursteptype -Pformat=[sjs|xqy] -i
stepDefName
(Required) The name of the custom step definition to create.
stepDefType
The type of the step definition to create: ingestion, mapping, mastering, or custom. Default is custom.
format
The format of the module to associate with the new step definition: xqy for XQuery or sjs for JavaScript. Default is sjs.

A JavaScript module (main.sjs) is created and associated with the step definition to perform the processes required for the step.

  • If -Pformat=sjs or if the option is not specified, only the main.sjs file is created, and it contains the processes required for the step.
  • If -Pformat=xqy, two files are created:
    • lib.xqy, which is the XQuery module that you must customize. It contains the processes required for the step; for example, custom code to create an envelope.
    • main.sjs, which acts as a wrapper around lib.xqy.

These modules can be found under your-project-root/src/main/ml-modules.

Tip: If your needs can be met by making minor changes to a step of a default type (ingestion, mapping, or mastering), simply modify the appropriate example step in the flow created by hubCreateFlow. The example steps use the predefined default-ingestion, default-mapping, and default-mastering step definitions, so you won't need to create a new one.
hubGeneratePii

Generates security configuration files for protecting entity properties designated as Personally Identifiable Information (PII). For details, see Managing Personally Identifiable Information.

./gradlew hubGeneratePii -igradlew.bat hubGeneratePii -i

MarkLogic Data Hub Flow Management Tasks

These tasks allow you to run flows and clean up.

hubRunFlow

Runs a flow.

./gradlew hubRunFlow -PflowName=YourFlowName -PentityName=YourEntityName -PbatchSize=100 -PthreadCount=4 -PshowOptions=[true|false] -PfailHard=[true|false] -Psteps="1,2" -PjobId="abc123" [ -Poptions="{ customkey: customvalue, ... }" | -PoptionsFile=/path/to.json ] -igradlew.bat hubRunFlow -PflowName=YourFlowName -PentityName=YourEntityName -PbatchSize=100 -PthreadCount=4 -PshowOptions=[true|false] -PfailHard=[true|false] -Psteps="1,2" -PjobId="abc123" [ -Poptions="{ customkey: customvalue, ... }" | -PoptionsFile=/path/to.json ] -i
flowName
(Required) The name of the harmonize flow to run.
entityName
(Required if the flow includes a mapping step) The name of the entity used with the mapping step.
batchSize
The number of items to include in a batch. Default is 100.
threadCount
The number of threads to run. Default is 4.
showOptions
If true, options that were passed to the command are printed out. Default is false.
failHard
If true, the flow's execution is ended immediately if a step fails. Default is false.
steps
The comma-separated numbers of the steps to run. If not provided, the entire flow is run.
jobId
A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.
options
A JSON structure containing key-value pairs to be passed as custom parameters to your step modules.
optionsFile
The path to a JSON file containing key-value pairs to be passed as custom parameters to your step modules.

The custom key-value parameters passed to your step module are available through the $options (xqy) or options (sjs) variables inside your step module.

hubExportJobs

Exports job records. This task does not affect the contents of the staging or final databases.

./gradlew hubExportJobs -PjobIds=ID1,ID2,IDn -Pfilename=export.zip -igradlew.bat hubExportJobs -PjobIds=ID1,ID2,IDn -Pfilename=export.zip -i
jobIds
A comma-separated list of job IDs to export.
filename
The name of the zip file to generated, including the file extension. Default is jobexport.zip.
hubDeleteJobs

Deletes job records. This task does not affect the contents of the staging or final databases.

./gradlew hubDeleteJobs -PjobIds=ID1,ID2,IDn -igradlew.bat hubDeleteJobs -PjobIds=ID1,ID2,IDn -i
jobIds
(Required) A comma-separated list of job IDs to delete.

MarkLogic Data Hub Record Management Tasks

These tasks allow you to perform actions on specific records outside a flow.

hubMergeEntities

Merges the specified records according to the settings of the specified mastering step.

./gradlew hubMergeEntities -PmergeURIs=URI1,URI2,URIn -PflowName=YourFlowName -Pstep=1 -Ppreview=[true|false] -Poptions={YourStepOptionOverrides} -igradlew.bat hubMergeEntities -PmergeURIs=URI1,URI2,URIn -PflowName=YourFlowName -Pstep=1 -Ppreview=[true|false] -Poptions={YourStepOptionOverrides} -i
mergeURIs
(Required) The comma-separated list of the URIs of the records to merge.
flowName
(Required) The name of a flow that includes a mastering step.
step
The step number of the mastering step in the specified flow. This task uses the settings in the mastering step. Default is 1, which assumes that the first step in the flow is a mastering step.
preview
If true, no changes are made to the database and a simulated merged record is returned; otherwise, the merged record is saved to the database. Default is false.
options
A JSON-formatted string that contains the mastering settings to override the settings in the specified mastering step. Default is {}.
hubUnmergeEntities

Reverses the set of merges that created the specified merged record.

./gradlew hubUnmergeEntities -PmergeURI=URIofMergedRecord -PretainAuditTrail=[true|false] -PblockFutureMerges=[true|false] -igradlew.bat hubUnmergeEntities -PmergeURI=URIofMergedRecord -PretainAuditTrail=[true|false] -PblockFutureMerges=[true|false] -i
mergeURI
(Required) The URI of the record to unmerge.
retainAuditTrail
If true, the merged record will be moved to an archive collection; otherwise, it will be deleted. Default is true.
blockFutureMerges
If true, the component records will be blocked from being merged together again. Default is true.
Note: This task archives or deletes the specified merged record and unarchives the component records that were combined to create it. If one of the component records is itself a merged record, the component record will remain so.

MarkLogic Data Hub Uninstall Tasks

mlUndeploy

Removes all components of your data hub from the MarkLogic server, including databases, application servers, forests, and users.

./gradlew mlUndeploy -Pconfirm=true -igradlew.bat mlUndeploy -Pconfirm=true -i

Legacy (DHF 4.x) Tasks

hubCreateInputFlow

Creates a legacy (DHF 4.x) input flow. The resulting DHF 4.x flow must be executed using hubRunLegacyFlow.

./gradlew hubCreateInputFlow -PentityName=YourEntityName -PflowName=YourFlowName -PdataFormat=[xml|json] -PpluginFormat=[xqy|sjs] -igradlew.bat hubCreateInputFlow -PentityName=YourEntityName -PflowName=YourFlowName -PdataFormat=[xml|json] -PpluginFormat=[xqy|sjs] -i
entityName
(Required) The name of the entity that owns the flow.
flowName
(Required) The name of the input flow to create.
dataFormat
xml or json. Default is json.
pluginFormat
xqy or sjs. The plugin programming language.
hubCreateHarmonizeFlow

Creates a legacy (DHF 4.x) harmonization flow. The resulting DHF 4.x flow must be executed using hubRunLegacyFlow.

./gradlew hubCreateHarmonizeFlow -PentityName=YourEntityName -PflowName=YourFlowName -PdataFormat=[xml|json] -PpluginFormat=[xqy|sjs] -PmappingName=yourmappingname -igradlew.bat hubCreateHarmonizeFlow -PentityName=YourEntityName -PflowName=YourFlowName -PdataFormat=[xml|json] -PpluginFormat=[xqy|sjs] -PmappingName=yourmappingname -i
entityName
(Required) The name of the entity that owns the flow.
flowName
(Required) The name of the harmonize flow to create.
dataFormat
xml or json. Default is json.
pluginFormat
xqy or sjs. The plugin programming language.
mappingName
The name of a model-to-model mapping to use during code generation.
hubRunLegacyFlow

Runs a (legacy) DHF 4.x harmonization flow.

./gradlew hubRunLegacyFlow -PentityName=YourEntityName -PflowName=YourFlowName -PbatchSize=100 -PthreadCount=4 -PsourceDB=data-hub-STAGING -PdestDB=data-hub-FINAL -PshowOptions=[true|false] -Pdhf.YourKey=YourValue -igradlew.bat hubRunLegacyFlow -PentityName=YourEntityName -PflowName=YourFlowName -PbatchSize=100 -PthreadCount=4 -PsourceDB=data-hub-STAGING -PdestDB=data-hub-FINAL -PshowOptions=[true|false] -Pdhf.YourKey=YourValue -i
entityName
(Required) The name of the entity containing the harmonize flow.
flowName
(Required) The name of the harmonize flow to run.
batchSize
The number of items to include in a batch. Default is 100.
threadCount
The number of threads to run. Default is 4.
sourceDB
The name of the database to run against. Default is the name of your staging database.
destDB
The name of the database to put harmonized results into. Default is the name of your final database.
showOptions
Whether to print out options that were passed in to the command. Default is false.
dhf.YourKey
The value to associate with your key. These key-value pairs are passed as custom parameters to your flow. You can pass additional key-value pairs as separate options:
hubrunlegacyflow ... -Pdhf.YourKeyA=YourValueA -Pdhf.YourKeyB=YourValueB ...

The custom key-value parameters passed to your step module are available through the $options (xqy) or options (sjs) variables inside your step module.