Flow and Step Configuration Structures (Hub Central Format)

Information in a flow configuration file and in step configuration files (Hub Central format).

Configuration Files

Flow configuration files are found in your-project-root/flows.
Step configuration files are found in your-project-root/steps.

Flow Configuration Structure

   {
    "name" : "MyFlow",
    "description" : "This flow contains examples of steps plus additional settings.",
    "batchSize" : 100,
    "threadCount" : 4,
    "stopOnError" : false,
    "options" : {
      "sourceQuery" : null,
      "provenanceGranularityLevel" : "fine",
    },
    "disableJobOutput" : "false",
    "permissions" : "data-hub-common,read,data-hub-common,update",
    "steps" : {
      "1" : {
        "stepId" : "MyIngestionStep-ingestion"
      },
      "2" : {
        "stepId" : "MyMappingStep-mapping"
      },
      "3" : {
        "stepId" : "MyMatchingStep-matching"
      },
      "4" : {
        "stepId" : "MyMergingStep-merging"
      }
      ...
    }
  }
Field Description
name The human-friendly name of the flow.
description A description of the flow.
batchSize The number of documents to process per batch. A smaller batch size provides finer granularity in the jobs reporting. However, a smaller batch file also costs more because of the processing overhead. The recommended batch size for merging is 1.
threadCount The number of threads to use when running a flow. The default is 4.
stopOnError If true and an error is encountered, the flow run ends, the rest of the source data is ignored, and the remaining steps are not performed. Information about the failure is logged in the job document. The default is false.
options Key-value pairs to pass as parameters to custom modules in every step in the flow.
options » sourceQuery The collection, CTS query, or custom script that selects the source data to process.
  • A collection or a CTS query returns a set of URIs.
  • A custom script can return a set of items of any type, including URIs.
Important: If using a custom script, add:
"sourceQueryIsScript" : true
options » provenanceGranularityLevel The granularity of the provenance tracking information: coarse (default) to store document-level provenance information only, fine to store document-level and property-level provenance information, or off to disable provenance tracking in future job runs. Applies only to mapping, matching, merging, mastering, and custom steps. Learn more: About Provenance and Lineage.
disableJobOutput If true, the job document is not created. The default is false.
permissions The permissions required to access the documents created by the flow.

The string must be in the format role,capability,role,capability,..., where capability can be read, insert, update, or execute.

steps The list of steps to run within the flow. Each step in the flow has a sequence number and a reference to the step definition file. The step definition differs based on the step type (ingestion or mapping).
step sequence The number which represents the order of the step in the sequence.
stepId The identifier of the step to run within the flow. Must be in the format stepName-stepType.

Common Step Settings

The following settings are common among all step types.

   {
    "name" : "...",
    "description" : "This is my step.",
    "stepDefinitionName" : "default-...",
    "stepDefinitionType" : "...",
    "stepId" : "...-...",
    "batchSize" : 100,
    "threadCount" : 4,
    "interceptors" : [
      {
        "path": "/uri/of/custom/module/in/modules/database/a.sjs",
        "vars": {
          "myParameter": "myParameterValue"
        }
        "when": "beforeContentPersisted",
      }
    ],
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "data-hub-operator",
      "runBefore" : false
    },
    "stepUpdate" : false,
    "acceptsBatch" : false,
    ...
    "collections" : [ "default-..." ],
    "additionalCollections" : [],
    ...
    "lastUpdated": "2020-06-04T23:01:35.540012Z",
    ...
  }
Field Description
name The name of the step instance.
description A description of the step.
stepDefinitionName The name of a step definition, which could be one of the default step definitions or a custom one. Custom step definitions can be created using QuickStart or using the Gradle task hubCreateStepDefinition.
Tip: If you are customizing a default step type (ingestion, mapping, or mastering), leave the value as default-ingestion, default-mapping, or default-mastering.
stepDefinitionType The type of the step definition: INGESTION, MAPPING, MATCHING, MERGING, MASTERING, or CUSTOM.
stepId The identifier of the step to run within the flow. Must be in the format stepName-stepType.
batchSize The number of documents to process per batch. A smaller batch size provides finer granularity in the jobs reporting. However, a smaller batch file also costs more because of the processing overhead. The recommended batch size for merging is 1.

If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.

threadCount The number of threads to use when running a flow. The default is 4.

If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.

interceptors An array of JSON objects specifying the custom modules that perform additional processes on a batch after the core step processes are completed and before the results are saved in the database.
customHook A step add-on that performs additional processes in its own transaction before or after the core step transaction. Results are saved within a transaction.

Learn more: Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.

customHook » module The URI of your custom hook module in the MODULES database.
customHook » parameters A JSON object containing parameters to pass to your custom hook module.
customHook » user The user account to use to run the module. Typically, a user with the security role data-hub-operator. Learn more: Users and Roles
customHook » runBefore For a pre-step hook, set to true. For a post-step hook, set to false.
stepUpdate If true, custom modules can make changes directly to records in the database (inserting, deleting, or locking); otherwise, custom modules can make changes indirectly by passing content objects to Data Hub APIs. Direct changes to the database are rarely needed. For a combined mastering step, the default is true. For all other types of steps, the default is false.
acceptsBatch If true, all the records in the batch are processed within a single step run; otherwise, the step is restarted and run for each record in the batch. The default is true.
collections The list of collection tags to assign to the resulting records. The default list is comprised of a collection with the same name as the step.
additionalCollections The list of collection tags to assign to the resulting records, in addition to the default collections.
lastUpdated The timestamp when the artifact was last deployed. The value is automatically updated during deployment.
sourceRecordScope The scope of the source record you want to map to. Choose Instance Only to map to the source instance in your envelope. Choose Entire Record to map to any of the source fields in your envelope. The default is Instance Only. Changes to the source record scope affect existing mapping expressions. Adjust existing mapping expressions to reflect the new paths to your source fields. Options: instanceOnly, entireRecord
attachSourceDocument Specifies whether the source document should be copied into the mapped entity instance. Options: true, false
headers A JSON structure describing metadata to be stored in the envelope header of a document. Special values will be substituted at execution time. Special values: currentUser: The user running the step. currentDateTime: The dateTime the batch runs as part of the step.

Ingestion Step Configuration Structure

   {
    "name" : "MyIngestionStep",
    "description" : "This is my ingestion step.",
    "stepDefinitionName" : "default-ingestion",
    "stepDefinitionType" : "ingestion",
    "stepId" : "MyIngestionStep-ingestion",
    "batchSize" : 100,
    "threadCount" : 4,
    "interceptors" : [
      {
        "path": "/uri/of/custom/module/in/modules/database/a.sjs",
        "vars": {
          "myParameter": "myParameterValue"
        }
        "when": "beforeContentPersisted",
      }
    ],
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "data-hub-operator",
      "runBefore" : false
    },
    "stepUpdate" : false,
    "acceptsBatch" : false,
    "inputFilePath" : "path/to/folder",
    "sourceFormat" : "json",
    "targetDatabase" : "data-hub-STAGING",
    "targetFormat" : "json",
    "collections" : [ "default-ingestion" ],
    "additionalCollections" : [],
    "outputURIPrefix" : "output/URI,'substitute/URI'",
    "lastUpdated": "2020-06-04T23:01:35.540012Z"
  }
Field Description
inputFilePath The location of your source files.
sourceFormat The format of your source files: Text, JSON, XML, Binary, or Delimited Text. The default is JSON.
Note: If you select Text, the source file's content must be in either JSON or XML format.
targetDatabaseThe STAGING database where you want to store the ingested data. The default is data-hub-STAGING.
targetFormat The format of the processed record: Text, JSON, XML, or Binary. The default is JSON.
outputURIPrefix A comma-separated list of replacements used to customize the URIs of the ingested records.

The list is comprised of regular expression patterns and their replacement strings in the format pattern,'string',pattern,'string',.... The replacement strings must be enclosed in single quotes.

For example, if the original URI is in the form "/foo/bar/filename", you can customize it to be "/mydir/filename" using the following comma-separated list:

/foo/bar,'/mydir'

Java's regular expression language is supported.

If Source File Type is set to CSV, the substitution is based on the absolute path of the parent folder; otherwise, the absolute path of the input file. For example, if the Windows path is c:\path\filename, the substitution is based on /c/path/filename.

Learn more: Transforming the Default URI.
Common Step Settings

Mapping Step Configuration Structure

   {
    "name" : "MyMappingStep",
    "description" : "This is my mapping step.",
    "stepDefinitionName" : "default-mapping",
    "stepDefinitionType" : "mapping",
    "stepId" : "MyMappingStep-mapping",
    "batchSize" : 100,
    "threadCount" : 4,
    "interceptors" : [
      {
        "path": "/uri/of/custom/module/in/modules/database/a.sjs",
        "vars": {
          "myParameter": "myParameterValue"
        }
        "when": "beforeContentPersisted",
      }
    ],
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "data-hub-operator",
      "runBefore" : false
    },
    "mappingParametersModulePath": "/custom-modules/custom/user-params.sjs",
    "stepUpdate" : false,
    "acceptsBatch" : false,
    "sourceDatabase" : "data-hub-STAGING",
    "selectedSource": "query",
    "sourceCollection" : "MyIngestionStep",
    "sourceQuery" : "cts.collectionQuery('my-custom-query')",
    "constrainSourceQueryToJob" : false,
    "targetEntityType" : "MyEntity",
    "validateEntity" : false,
    "targetDatabase" : "data-hub-FINAL",
    "targetFormat" : "json",
    "collections" : [ "default-mapping" ],
    "additionalCollections" : [],
    "provenanceGranularityLevel" : "fine",
    "lastUpdated": "2020-06-04T23:01:35.540012Z",
    "properties": {
      "customerId": {
        "sourcedFrom": "CustomerID"
      },
      "name": {
        "sourcedFrom": "concat(Name/FirstName, ' ', Name/LastName)"
      }
    }
  }
Field Description
sourceDatabase The database from which to take the input data. Choose the STAGING database where you stored ingested data. The default is data-hub-STAGING.
selectedSource The way the data set is selected and retrieved: by collection or by query.
sourceCollection The collection tag to use to search for the records to process in this step.
sourceQuery The collection, CTS query, or custom script that selects the source data to process.
  • A collection or a CTS query returns a set of URIs.
  • A custom script can return a set of items of any type, including URIs.
Important: If using a custom script, add:
"sourceQueryIsScript" : true
constrainSourceQueryToJob If true, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if sourceQuery is cts.collectionQuery('example') and constrainSourceQueryToJob is true, the query searches for documents that are in the example collection and were created or modified in the current job. The default is false.
targetEntityType The entity to map against the source data.
validateEntity Indicates whether to validate the mapped entity instance against the schema document based on the entity model, and what action to take. Set to false to skip validation. Set to accept to write the mapped entity instance to the database regardless of the validation result. Set to reject to skip writing the mapped entity instance to the database if the validation failed. The default is false. Learn more: about-mapping.html#about-mapping__validation-of-mapped-expressions.
targetDatabaseThe FINAL database where you want to store mapped data. The default is data-hub-FINAL.
targetFormat The format of the processed record: Text, JSON, XML, or Binary. The default is JSON.
provenanceGranularityLevel The granularity of the provenance tracking information: coarse (default) to store document-level provenance information only, fine to store document-level and property-level provenance information, or off to disable provenance tracking in future job runs. Applies only to mapping, matching, merging, mastering, and custom steps. Learn more: About Provenance and Lineage.
properties The mapping of the raw data model against the entity model. Each key is a property of the entity type, and sourcedFrom specifies how to determine the value of the property.
ParametersModulePath A path to a JavaScript library in the modules database with two functions. The first function getParameterDefinitions(mappingStep) accepts the mapping step as input and returns an array of objects with name and description properties defining the parameters. The second function getParameterValues(contentSequence) accepts a Sequence of Content Objects for a mapping batch and returns a JSON Object with key/value pairs that align with parameter names and their values.
Common Step Settings

Matching Step Configuration Structure

   {
   "name" : "MyMatchingStep",
   "description" : "This is my matching step.",
   "stepDefinitionName" : "default-matching",
   "stepDefinitionType" : "matching",
   "stepId" : "MyMatchingStep-matching",
   "batchSize" : 100,
   "threadCount" : 4,
   "matchRulesets": [
   {
       "name": "name - Exact",
       "weight": 3.5, 
       "matchRules": [
         {
           "entityPropertyPath": "name",
           "matchType": "exact",
           "options": {}
         } 
       ]
     }
   ],
   "interceptors" : [

   ...

   ],

   ...

   },
Field Description
sourceDatabase The database from which to take the input data. Choose the FINAL database where you stored mapped data. The default is data-hub-FINAL.
selectedSource The way the data set is selected and retrieved: by collection or by query.
sourceCollection The collection tag to use to search for the records to process in this step.
sourceQuery The collection, CTS query, or custom script that selects the source data to process.
  • A collection or a CTS query returns a set of URIs.
  • A custom script can return a set of items of any type, including URIs.
Important: If using a custom script, add:
"sourceQueryIsScript" : true
constrainSourceQueryToJob If true, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if sourceQuery is cts.collectionQuery('example') and constrainSourceQueryToJob is true, the query searches for documents that are in the example collection and were created or modified in the current job. The default is false.
targetEntityType The entity to map against the source data.
targetDatabase The same database you selected in Source Database. The default is data-hub-FINAL.
Important: For split mastering (matching step and merging step), both the source database and the target database for both steps must be the same.
provenanceGranularityLevel The granularity of the provenance tracking information: coarse (default) to store document-level provenance information only, fine to store document-level and property-level provenance information, or off to disable provenance tracking in future job runs. Applies only to mapping, matching, merging, mastering, and custom steps. Learn more: About Provenance and Lineage.
matchRuleset The weighted criteria for what is considered a match.
ParametersModulePath A path to a JavaScript library in the modules database with two functions. The first function getParameterDefinitions(mappingStep) accepts the mapping step as input and returns an array of objects with name and description properties defining the parameters. The second function getParameterValues(contentSequence) accepts a Sequence of Content Objects for a mapping batch and returns a JSON Object with key/value pairs that align with parameter names and their values.
Common Step Settings

Matching Options

   "matchOptions" : {
    "dataFormat" : "json",
    "propertyDefs" : {
      "property" : [
        {
          "name" : "ssn",
          "namespace" : "",
          "localname" : "IdentificationID"
        },
      ]
    },
    "algorithms" : {
      "algorithm" : [
        {
          "name" : "std-reduce",
          "function" : "standard-reduction",
          "namespace" : "",
          "at" : ""
        },
      ]
    },
    "collections" : {
      "content" : [ "my-content-collection" ]
    },
    "scoring" : {
      "add" : [
        {
          "propertyName" : "ssn",
          "weight" : "50"
        },
      ],
      "expand" : [
        {
          "propertyName" : "first-name",
          "algorithmRef" : "thesaurus",
          "weight" : "6",
          "thesaurus" : "/mdm/config/thesauri/first-name-synonyms.xml"
        },
        {
          "propertyName" : "last-name",
          "algorithmRef" : "dbl-metaphone",
          "weight" : "8",
          "dictionary" : "name-dictionary.xml",
          "distanceThreshold" : "50"
        }
      ],
      "reduce" : [
        {
          "algorithmRef" : "std-reduce",
          "weight" : "4",
          "allMatch" : { "property" : ["last-name", "addr1"] }
        }
      ]
    },
    "actions" : {
      "action" : [
        {
          "name" : "my-custom-action",
          "function" : "custom-action",
          "namespace" : "http://marklogic.com/smart-mastering/action",
          "at" : "/custom-action.xqy"
        }
      ]
    },
    "thresholds" : {
      "threshold" : [
        { "above" : "30", "label" : "Possible Match" },
        { "above" : "50", "label" : "Likely Match", "action" : "notify" },
        { "above" : "68", "label" : "Definitive Match", "action" : "merge" },
        { "above" : "75", "label" : "Custom Match", "action" : "my-custom-action" }
      ]
    },
    "tuning" : {
      "maxScan" : 200
    }
  },
Field Description
matchOptions The settings used to find potential matches. Learn more: Smart Mastering Core - Matching Options
dataFormat The format of your source records: Text, JSON, XML, or Binary. The default is JSON.
propertyDefs Definitions of properties to compare.
propertyDefs » property » name The alias for this property definition.
propertyDefs » property » namespace (Optional) The namespace that encompasses the XML element or JSON property (record field) to compare.
propertyDefs » property » localname The name of the XML element or JSON property (record field) to compare.
algorithms Definitions of algorithms that compare values. Each algorithm corresponds to a match type (Exact, Synonym, Double Metaphone, Reduce, Zip, and Custom). The default algorithm is that of the Exact match type, which determines whether two values are equal.
algorithms » algorithm » name The alias for this algorithm definition.
algorithms » algorithm » function The function to run if this algorithm definition is selected.
algorithms » algorithm » namespace(Optional) The namespace of the module that contains the function.
algorithms » algorithm » at The path to the module that contains the function.
collections A set of collections that overrides the default collection used to determine the scope of the dataset being compared. If multiple content elements are specified, the dataset is restricted to an intersection of those collections.
collections » content One or more collections used to determine the dataset to be compared.
scoring Rules (add, expand, reduce) that define how the comparison is scored based on assigned weights. The maximum possible score is the sum of the weights of all of the weight attributes. The match process uses the simple scoring option, with the property weight controlling how much influence each should have. Learn more: Relevance Scores.
scoring » add Properties whose values are simply compared between the records and, if the values match exactly, the assigned weight is added to the score.
scoring » add » propertyName The alias of a property definition under the matchOptions/propertyDefs node of this step.
scoring » add » weight The weight added to the score if the property values of two records match exactly.
scoring » expand Properties whose values are compared using a different algorithm to determine a match. For example, the property values can be considered a positive match if one is a synonym of the other or if both values phonetically sound alike. If so, the assigned weight is added to the score.
scoring » expand » propertyName The alias of a property definition under the matchOptions/propertyDefs node of this step.
scoring » expand » algorithmRef The alias of an algorithm definition under the matchOptions/algorithms node of this step.
scoring » expand » weight The weight added to the score if the property values of two records are considered a match based on the selected algorithm.
scoring » expand » thesaurus The location of the thesaurus that is stored in a MarkLogic Server database and used to determine synonyms. Learn more: Managing Thesaurus Documents
scoring » expand » dictionary The location of the phonetic dictionary that is stored in a database and used when comparing words phonetically. Learn more: Custom Dictionaries
scoring » expand » distanceThreshold The threshold below which the phonetic difference (distance) between two strings is considered insignificant; i.e., the strings are similar to each other. Learn more: spell functions
scoring » reduce Combinations of properties whose matching values might be a false match. For example, two members of the same family with the same last names and addresses might be misinterpreted as being the same person. In this case, the score is reduced by the assigned weight to give the match less importance.
scoring » reduce » algorithmRef The alias of an algorithm definition under the matchOptions/algorithms node of this step.
scoring » reduce » weight A positive integer that denotes how much to reduce the weight of a match.
scoring » reduce » allMatch The combination of properties that might falsely indicate a match if the values of these properties are equal between two records.
actions Custom actions that can be performed when a threshold is reached. Learn more: Custom Match Actions
actions » action The custom action to perform when a threshold is reached.
actions » action » name The alias for this action definition.
actions » action » function The function to run if this action definition is selected.
actions » action » namespace(Optional) The namespace of the module that contains the function.
actions » action » at The path to the module that contains the function.
thresholds Score thresholds that trigger an action.
thresholds » threshold A score threshold definition, including the action to perform if the threshold is exceeded.
thresholds » threshold » above The score threshold. If the match score exceeds this value, the action is performed.
thresholds » threshold » label The alias for this threshold definition.
thresholds » threshold » action The action to perform if the score is above the threshold. Possible values:
  • notify creates a notification record in the FINAL database with information about the match.
  • merge creates a new record with the combined properties of the original records that match, then archives the old records.
  • The alias of an action definition under the matchOptions/actions node of this step.
tuning » maxScan The maximum number of highest scoring potential matches that will be considered for merging.

Merging Step Configuration Structure

   {
    "name" : "MyMergingStep",
    "description" : "This is my merging step.",
    "stepDefinitionName" : "default-merging",
    "stepDefinitionType" : "merging",
    "stepId" : "MyMergingStep-merging",
    "batchSize" : 100,
    "threadCount" : 4,
    "interceptors" : [
      {
        "path": "/uri/of/custom/module/in/modules/database/a.sjs",
        "vars": {
          "myParameter": "myParameterValue"
        }
        "when": "beforeContentPersisted",
      }
    ],
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "data-hub-operator",
      "runBefore" : false
    },
    "stepUpdate" : false,
    "acceptsBatch" : false,
    "sourceDatabase" : "data-hub-FINAL",
    "selectedSource": "query",
    "sourceCollection" : "MyMatchingStep",
    "sourceQuery" : "cts.collectionQuery('my-custom-query')",
    "constrainSourceQueryToJob" : false,
    "targetEntityType" : "MyEntity",
    "targetDatabase" : "data-hub-FINAL",
    "targetFormat" : "json",
    "collections" : [ "MyMergingStep", "MyPersonEntity" ],
    "additionalCollections" : [],
    "provenanceGranularityLevel" : "fine",
    "lastUpdated": "2020-06-04T23:01:35.540012Z",
    "mergeOptions" : { ... }
  }
Field Description
sourceDatabase The database from which to take the input data. Choose the same source database that you selected in the matching step. The default is data-hub-FINAL.
selectedSource The way the data set is selected and retrieved: by collection or by query.
sourceCollection The collection tag to use to search for the records to process in this step.
sourceQuery The collection, CTS query, or custom script that selects the source data to process.
  • A collection or a CTS query returns a set of URIs.
  • A custom script can return a set of items of any type, including URIs.
Important: If using a custom script, add:
"sourceQueryIsScript" : true
constrainSourceQueryToJob If true, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if sourceQuery is cts.collectionQuery('example') and constrainSourceQueryToJob is true, the query searches for documents that are in the example collection and were created or modified in the current job. The default is false.
targetEntityType The entity to map against the source data.
targetDatabase The same database you selected in Source Database. The default is data-hub-FINAL.
Important: For split mastering (matching step and merging step), both the source database and the target database for both steps must be the same.
targetFormat The format of the processed record: Text, JSON, XML, or Binary. The default is JSON.
provenanceGranularityLevel The granularity of the provenance tracking information: coarse (default) to store document-level provenance information only, fine to store document-level and property-level provenance information, or off to disable provenance tracking in future job runs. Applies only to mapping, matching, merging, mastering, and custom steps. Learn more: About Provenance and Lineage.
Common Step Settings

Merging Options

   "mergeOptions" : {
    "matchOptions" : "mlw-match",
    "propertyDefs" : {
      "properties" : [
        {
          "name" : "ssn",
          "localname" : "IdentificationID",
          "namespace" : ""
        },
        {
          "name" : "shallow",
          "path" : "/es:envelope/es:headers/shallow"
        }
      ],
      "namespaces" : {
        "has" : "has",
        "m" : "http://marklogic.com/smart-mastering/merging",
        "es" : "http://marklogic.com/entity-services"
      }
    },
    "algorithms" : {
      "stdAlgorithm" : {
        "timestamp" : { "path" : "/es:envelope/es:headers/sm:sources/sm:source/sm:dateTime" },
        "namespaces" : {
          "sm" : "http://marklogic.com/smart-mastering",
          "es" : "http://marklogic.com/entity-services"
        }
      },
      "custom" : [
        {
          "name" : "customMerge",
          "function" : "doCustomMerge",
          "namespace" : "http://marklogic.com/smart-mastering/merging",
          "at" : "/custom-merge-xqy.xqy"
        }
      ],
      "collections" : {
        "onMerge" : {
          "function" : "collections",
          "namespace" : "test/merge-collection-algorithm",
          "at" : "/test/suites/customizing-collections/lib/merged-collections.xqy"
        },
        "onArchive" : {
          "remove" : { "collection" : ["Entity"] },
          "add" : { "collection" : ["custom-archived"] }
        },
        "onNoMatch" : {
          "function" : "noMatchCollections",
          "namespace" : "",
          "at" : "/test/suites/customizing-collections/lib/noMatchCollections.sjs"
        },
        "onNotification" : {
          "set" : { "collection" : ["notification"] }
        }
      }
    },
    "mergeStrategies" : [
      {
        "name" : "crm-source-weight",
        "algorithmRef" : "standard",
        "sourceWeights" : [
          {
            "source" : {
              "name" : "CRM",
              "weight" : "10"
            }
          }
        ]
      },
      {
        "name" : "length-weight",
        "algorithmRef" : "standard",
        "maxValues" : "1",
        "length" : { "weight" : "10" }
      }
    ],
    "merging" : [
      {
        "propertyName" : "ssn",
        "maxValues" : "1",
        "maxSources" : "1",
        "strategy" : "crm-source-weight"
      },
      {
        "propertyName" : "name",
        "maxValues" : "1",
        "doubleMetaphone" : {
          "distanceThreshold" : "50"
        },
        "synonymsSupport" : "true",
        "thesaurus" : "/mdm/config/thesauri/first-name-synonyms.xml",
        "length" : { "weight" : "8" }
      },
      {
        "propertyName" : "dob",
        "maxValues" : "1",
        "algorithmRef" : "standard",
        "sourceWeights" : {
          "source" : {
            "name" : "better-source",
            "weight" : "4"
          }
        }
      },
      {
        "default" : "true",
        "strategy" : "crm-source-weight"
      }
    ],
    "tripleMerge" : {
      "function" : "custom-trips",
      "namespace" : "http://marklogic.com/smart-mastering/merging",
      "at" : "/custom-triple-merge.xqy",
      "some-param" : 3
    }
  }
Field Description
mergeOptions The settings used to merge records that match. Learn more: Smart Mastering Core - Merging Options.
matchOptions The name of a set of match options that were previously stored in the server. Learn more: Saving Options.
propertyDefs Definitions of properties to merge.
propertyDefs » properties » name The alias for this property definition.
propertyDefs » properties » localname The name of the XML element or JSON property (record field) to merge.
propertyDefs » properties » namespace

(Optional) The namespace that encompasses the XML element or JSON property (record field) to merge.

propertyDefs » properties » path Path leading to the headers or instance sections of records, where the merge properties are defined.
  • XML: /es:envelope/es:headers
  • JSON: /envelope/headers
  • XML: /es:envelope/es:instance
  • JSON: /envelope/instance
Note: Namespaces in the path must be defined in propertyDefs/namespaces node.
propertyDefs » namespaces Key-value pairs that assign aliases to namespaces. The key is the alias and the value is the full namespace.
algorithms Definitions of algorithms that merge values.
algorithms » stdAlgorithm The standard algorithm that implements the default merge behavior.
algorithms » stdAlgorithm » timestamp The path to a timestamp field within the record. This field is used to determine which values to include in the merged property, based on their recency, up to the maximum number specified in the Max Values field in Merge Options (Standard) or in Merge Strategies. Namespaces used in the path must be defined within the record.
algorithms » stdAlgorithm » namespaces(Optional) Key-value pairs that assign aliases to namespaces. The key is the alias and the value is the full namespace.
algorithms » custom Definitions of custom algorithms that merge values.
algorithms » custom » name The alias for this custom algorithm definition.
algorithms » custom » function The custom merge function to run.
algorithms » custom » namespace(Optional) The namespace of the module that contains the function.
algorithms » custom » at The path to the module that contains the function.
algorithms » collections Rules that specify how collection tags are managed when an event occurs.
algorithms » collections » onMerge How collection tags are applied to the new record that was created when matching records are merged. The default set of collection tags is comprised of:
  • the union of collection tags from the original records,
  • plus mdm-content,
  • plus mdm-merged.
algorithms » collections » onMerge » function The function that manages collection tags if the event occurs.
algorithms » collections » onMerge » namespace(Optional) The namespace of the module that contains the function.
algorithms » collections » onMerge » at The path to the module that contains the function.
algorithms » collections » onArchive How collection tags are applied to the original records after their content have been merged into a new record. The default set of collection tags is comprised of:
  • the collection tags from the original record,
  • minus mdm-content,
  • plus mdm-archived.
algorithms » collections » onArchive » remove One or more collection tags to remove from the default union of tags.
algorithms » collections » onArchive » add One or more collection tags to add to the default union of tags.
algorithms » collections » onNoMatch How collection tags are applied to records that were not merged because no matches were found or because the total matching scores did not exceed the defined thresholds. The default set of collection tags is comprised of:
  • the collection tags from the original record,
  • plus mdm-content.
algorithms » collections » onNoMatch » function The function that manages collection tags if the event occurs.
algorithms » collections » onNoMatch » namespace(Optional) The namespace of the module that contains the function.
algorithms » collections » onNoMatch » at The path to the module that contains the function.
algorithms » collections » onNotification How collection tags are applied to notification records. The default set of collection tags is comprised of mdm-notification only.
algorithms » collections » onNotification » set One or more collection tags to replace the default union of tags.
mergeStrategies Predefined configurations for merging.
mergeStrategies » name The name for the strategy.
mergeStrategies » algorithmRef The alias of an algorithm definition under the mergeOptions/algorithms node of this step.
mergeStrategies » sourceWeights The list of data sources and the weights assigned to them. If the set of matching records come from more sources than maxSources, the source weights are used to determine which records are included in the merge. Example: If maxSources is set to 1, only records from the highest weighted source are included in the merge.
mergeStrategies » sourceWeights » source » name The name of the source, exactly as shown in the envelopes of the records under headers » sources.
mergeStrategies » sourceWeights » source » weight The weight used to decide the priority of a source when merging.
mergeStrategies » maxValues The maximum number of values to allow in the merged property. The default is 99.
mergeStrategies » length » weight The weight assigned to the length of a string.
merging Rules that specify how to merge records that match.
merging » propertyName The alias of a property definition under the mergeOptions/propertyDefs node of this step.
merging » maxValues The maximum number of values to allow in the merged property. The default is 99.
merging » maxSources The maximum number of data sources from which to get values to merge. For example, to copy values from a single source, set maxSources to 1.
merging » strategy The alias of a strategy definition under the mergeOptions/mergeStrategies node of this step.
merging » doubleMetaphone If this setting is present, the Double Metaphone algorithm is used to determine the values to merge.
merging » doubleMetaphone » distanceThreshold The threshold below which the phonetic difference (distance) between two strings is considered insignificant; i.e., the strings are similar to each other. Learn more: spell functions
merging » synonymsSupport If true, synonyms are included in the list of values to merge. Synonyms are determined using the specified thesaurus.
merging » thesaurus The location of the thesaurus that is stored in a MarkLogic Server database and used to determine synonyms. Learn more: Managing Thesaurus Documents
merging » length The weight assigned to the length of a string.
merging » algorithmRef The alias of an algorithm definition under the mergeOptions/algorithms node of this step.
merging » sourceWeights The list of data sources and the weights assigned to them. If the set of matching records come from more sources than maxSources, the source weights are used to determine which records are included in the merge. Example: If maxSources is set to 1, only records from the highest weighted source are included in the merge.
merging » sourceWeights » source » name The name of the source, exactly as shown in the envelopes of the records under headers » sources.
merging » sourceWeights » source » weight The weight used to decide the priority of a source when merging.
merging » default If true, the specified strategy is the default. Important: If this setting is present, do not include a propertyName setting.
merging » strategy The alias of a strategy definition under the mergeOptions/mergeStrategies node of this step.
tripleMerge Definition of an algorithm that merges triples.
tripleMerge » function The function that merges triples.
tripleMerge » namespace The namespace of the module that contains the function.
tripleMerge » at The path to the module that contains the function.
tripleMerge » some-param Parameters, as key-value pairs, to pass to your triple merge function.

Custom Step Configuration Structure

   {
    "name" : "MyCustomOtherStep",
    "description" : "This is my custom-other step.",
    "stepDefinitionName" : "custom-step-def",
    "stepDefinitionType" : "custom",
    "stepId" : "MyCustomOtherStep-custom",
    "batchSize" : 100,
    "threadCount" : 4,
    "interceptors" : [
      {
        "path": "/uri/of/custom/module/in/modules/database/a.sjs",
        "vars": {
          "myParameter": "myParameterValue"
        }
        "when": "beforeContentPersisted",
      }
    ],
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "data-hub-operator",
      "runBefore" : false
    },
    "stepUpdate" : false,
    "acceptsBatch" : false,
    "sourceDatabase" : "data-hub-STAGING",
    "selectedSource": "query",
    "sourceCollection" : "my-collection-tag",
    "sourceQuery" : "cts.collectionQuery('my-custom-query')",
    "constrainSourceQueryToJob" : false,
    "targetEntityType" : "MyEntity",
    "targetDatabase" : "data-hub-FINAL",
    "targetFormat" : "json",
    "collections" : [ "my-collection-tag" ],
    "additionalCollections" : [],
    "lastUpdated": "2020-06-04T23:01:35.540012Z"
  }
Field Description
sourceDatabase The database from which to take the input data.
  • Mapping: Choose the STAGING database where you stored ingested data. The default is data-hub-STAGING.
  • Mastering: Choose the FINAL database where you stored mapped data. The default is data-hub-FINAL.
selectedSource The way the data set is selected and retrieved: by collection or by query.
sourceCollection The collection tag to use to search for the records to process in this step.
sourceQuery The collection, CTS query, or custom script that selects the source data to process.
  • A collection or a CTS query returns a set of URIs.
  • A custom script can return a set of items of any type, including URIs.
Important: If using a custom script, add:
"sourceQueryIsScript" : true
constrainSourceQueryToJob If true, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if sourceQuery is cts.collectionQuery('example') and constrainSourceQueryToJob is true, the query searches for documents that are in the example collection and were created or modified in the current job. The default is false.
targetEntityType The entity to map against the source data.
targetDatabase The database where you want to store the processed data.
  • Custom-Ingestion: The STAGING database where you want to store the ingested data. The default is data-hub-STAGING.
  • Custom-Other: The database where you want to store the processed data.
targetFormat The format of the processed record: Text, JSON, XML, or Binary. The default is JSON.
Common Step Settings