Flow Definition File

Information in a flow definition file, including steps.

Overview

The default flow definition file generated by the Gradle task hubCreateFlow includes flow settings, as well as example steps.

You must customize the example steps before running the flow. You can delete the steps you don't need, and you can duplicate the steps if you need multiple steps of the same type. However, you must assign a unique sequence number for each step.

To learn about flows, see About Flows.

To learn about step types, see About Steps.

Flow Settings

   {
    "name" : "your-flow-name",
    "description" : "This is the default flow containing all of the default steps, generated by hubCreateFlow, plus additional settings and examples.",
    "batchSize" : 100,
    "threadCount" : 4,
    "options" : {
      "sourceQuery" : null
    },
    "steps" : {
      "1" : { ... },
      "2" : { ... },
      "3" : { ... },
      "4" : { ... }
    }
  }
Field Description
name The human-friendly name of the flow.
description (Optional) A description of the flow.
batchSize The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts.
threadCount The number of threads to use when running a flow.
options Key-value pairs to pass as parameters to custom modules in every step in the flow.
options » sourceQuery The CTS query to use to select the source data to process. See CTS Query.
steps The steps to be run within the flow. Each step in the flow has a sequence number and a customized copy of the step definition. The step definition differs based on the step type (ingestion, mapping, mastering, or custom).

Ingestion Step Settings

   "1" : {
    "name" : "ingestion-step",
    "description" : "This is my ingestion step.",
    "stepDefinitionName" : "default-ingestion",
    "stepDefinitionType" : "INGESTION",
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "targetDatabase" : "data-hub-STAGING",
      "sourceQuery" : null,
      "outputFormat" : "json",
      "collections" : [ "default-ingestion" ]
    },
    "fileLocations" : {
      "inputFilePath" : "path/to/folder",
      "inputFileType" : "json",
      "outputURIReplacement" : "output/URI,'substitute/URI'"
    }
  },
Field Description
name The name of the step instance.
description A description of the step.
stepDefinitionName The name of the step definition.
stepDefinitionType The type of the step definition: INGESTION, MAPPING, MASTERING, or CUSTOM.
batchSize The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
threadCount The number of threads to use when running a flow. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
customHook Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module The path to your custom hook module.
customHook » parameters Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user The user account to use to run the module. Default is the user running the flow; e.g., flow-operator.
customHook » runBefore For a pre-step hook, set to true. For a post-step hook, set to false.
options » targetDatabase For ingestion, choose the STAGING database where you want to store the ingested data. Default is data-hub-STAGING.
options » sourceQuery The CTS query to use to select the source data to process. See CTS Query.
options » outputFormat The format that you want your data to be stored as.
options » collections The collection tags to assign to the resulting records.
fileLocations » inputFilePath The location of your source files.
fileLocations » inputFileType The format of your source files.
fileLocations » outputURIReplacement A comma-separated list of replacements used to customize the URIs of the ingested records. The list is comprised of regular expression patterns and their replacement strings in the format pattern,'string',pattern,'string',.... The replacement strings must be enclosed in single quotes.

Mapping Step Settings

   "2" : {
    "name" : "mapping-step",
    "description" : "This is my mapping step.",
    "stepDefinitionName" : "default-mapping",
    "stepDefinitionType" : "MAPPING",
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "sourceDatabase" : "data-hub-STAGING",
      "targetDatabase" : "data-hub-FINAL",
      "sourceQuery" : "cts.collectionQuery('default-ingestion')",
      "collections" : [ "default-mapping" ],
      "targetEntity" : "entity-name",
      "mapping" : {
        "name" : "mapping-name",
        "version" : 1
      }
    }
  },
Field Description
name The name of the step instance.
description A description of the step.
stepDefinitionName The name of the step definition.
stepDefinitionType The type of the step definition: INGESTION, MAPPING, MASTERING, or CUSTOM.
batchSize The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
threadCount The number of threads to use when running a flow. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
customHook Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module The path to your custom hook module.
customHook » parameters Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user The user account to use to run the module. Default is the user running the flow; e.g., flow-operator.
customHook » runBefore For a pre-step hook, set to true. For a post-step hook, set to false.
options » sourceDatabase For mapping, choose the STAGING database where you obtained the data. Default is data-hub-STAGING.
options » targetDatabase For mapping, choose the FINAL database where you want to store mapped data. Default is data-hub-FINAL.
options » sourceQuery The CTS query to use to select the source data to process. See CTS Query.
options » collections The collection tags to assign to the resulting records.
options » targetEntity The entity to map against the source data.
options » mapping How to map the properties of the targetEntity to the fields of the source data.
options » mapping » name The name of your mapping that is defined in your-project-root/mappings/your-mapping-name/mapping.version.json.
options » mapping » version The version of the mapping to use. Your mapping must be defined in your-project-root/mappings/your-mapping-name/mapping.version.json.

Mastering Step Settings

   "3" : {
    "name" : "mastering-step",
    "description" : "This is my mastering step.",
    "stepDefinitionName" : "default-mastering",
    "stepDefinitionType" : "MASTERING",
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "sourceDatabase" : "data-hub-FINAL",
      "targetDatabase" : "data-hub-FINAL",
      "sourceQuery" : "cts.andQuery([cts.collectionQuery('default-mapping'),cts.collectionQuery('mdm-content')])",
      "collections" : [ "default-mastering, mastered" ],
      "targetEntity" : "entity-name",
      "matchOptions" : { ... },
      "mergeOptions" : { ... }
    }
  },
Field Description
name The name of the step instance.
description A description of the step.
stepDefinitionName The name of the step definition.
stepDefinitionType The type of the step definition: INGESTION, MAPPING, MASTERING, or CUSTOM.
batchSize The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
threadCount The number of threads to use when running a flow. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
customHook Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module The path to your custom hook module.
customHook » parameters Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user The user account to use to run the module. Default is the user running the flow; e.g., flow-operator.
customHook » runBefore For a pre-step hook, set to true. For a post-step hook, set to false.
options » sourceDatabase For mastering, choose the FINAL database where you stored processed data. Default is data-hub-FINAL.
options » targetDatabase For mastering, choose the FINAL database where you want to store mastered data. Default is data-hub-FINAL. Note: For a mastering step, the source database and the target database must be the same. If duplicates are found, the original records are archived and the merged version is added to the same database. If you want the target database to be different, you can create a custom step with a custom module to override the default behavior of the mastering step.
options » sourceQuery The CTS query to use to select the source data to process. See CTS Query.
options » collections The collection tags to assign to the resulting records.
options » targetEntity The entity to map against the source data.

Mastering Step - Matching Options

   "matchOptions" : {
    "dataFormat" : "json",
    "propertyDefs" : {
      "property" : [
        {
          "name" : "ssn",
          "namespace" : "",
          "localname" : "IdentificationID"
        },
      ]
    },
    "algorithms" : {
      "algorithm" : [
        {
          "name" : "std-reduce",
          "function" : "standard-reduction",
          "namespace" : "",
          "at" : ""
        },
      ]
    },
    "collections" : {
      "content" : [ "my-content-collection" ]
    },
    "scoring" : {
      "add" : [
        {
          "propertyName" : "ssn",
          "weight" : "50"
        },
      ],
      "expand" : [
        {
          "propertyName" : "first-name",
          "algorithmRef" : "thesaurus",
          "weight" : "6",
          "thesaurus" : "/mdm/config/thesauri/first-name-synonyms.xml"
        },
        {
          "propertyName" : "last-name",
          "algorithmRef" : "dbl-metaphone",
          "weight" : "8",
          "dictionary" : "name-dictionary.xml",
          "distanceThreshold" : "50"
        }
      ],
      "reduce" : [
        {
          "algorithmRef" : "std-reduce",
          "weight" : "4",
          "allMatch" : { "property" : ["last-name", "addr1"] }
        }
      ]
    },
    "actions" : {
      "action" : [
        {
          "name" : "my-custom-action",
          "function" : "custom-action",
          "namespace" : "http://marklogic.com/smart-mastering/action",
          "at" : "/custom-action.xqy"
        }
      ]
    },
    "thresholds" : {
      "threshold" : [
        { "above" : "30", "label" : "Possible Match" },
        { "above" : "50", "label" : "Likely Match", "action" : "notify" },
        { "above" : "68", "label" : "Definitive Match", "action" : "merge" },
        { "above" : "75", "label" : "Custom Match", "action" : "my-custom-action" }
      ]
    },
    "tuning" : {
      "maxScan" : 200
    }
  },
Field Description
matchOptions The settings used to find potential matches. See Smart Mastering Core - Matching Options.
dataFormat The format of your source files.
propertyDefs Definitions of properties to compare.
propertyDefs » property » name The alias for this property definition.
propertyDefs » property » namespace (Optional) The namespace that encompasses the XML element or JSON property (record field) to compare.
propertyDefs » property » localname The name of the XML element or JSON property (record field) to compare.
algorithms Definitions of algorithms that compare values. Each algorithm corresponds to a match type (Exact, Synonym, Double Metaphone, Reduce, Zip, and Custom). The default algorithm is that of the Exact match type, which determines whether two values are equal.
algorithms » algorithm » name The alias for this algorithm definition.
algorithms » algorithm » function The function to run if this algorithm definition is selected.
algorithms » algorithm » namespace (Optional) The namespace of the module that contains the function.
algorithms » algorithm » at The path to the module that contains the function.
collections A set of collections that override the default collection used to determine the scope of the dataset being compared. If multiple content elements are specified, the dataset is restricted to an intersection of those collections.
collections » content One or more collections used to determine the dataset to be compared.
scoring Rules (add, expand, reduce) that define how the comparison is scored based on assigned weights. The maximum possible score is the sum of the weights of all of the weight attributes. The match process uses the simple scoring option, with the property weight controlling how much influence each should have. See Relevance Scores.
scoring » add Properties whose values are simply compared between the records and, if the values match exactly, the assigned weight is added to the score.
scoring » add » propertyName The alias of a property definition under the matchOptions/propertyDefs node of this step.
scoring » add » weight The weight added to the score if the property values of two records match exactly.
scoring » expand Properties whose values are compared using a different algorithm to determine a match. For example, the property values can be considered a positive match if one is a synonym of the other or if both values phonetically sound alike. If so, the assigned weight is added to the score.
scoring » expand » propertyName The alias of a property definition under the matchOptions/propertyDefs node of this step.
scoring » expand » algorithmRef The alias of an algorithm definition under the matchOptions/algorithms node of this step.
scoring » expand » weight The weight added to the score if the property values of two records are considered a match based on the selected algorithm.
scoring » expand » thesaurus The location of the thesaurus that is stored in a MarkLogic Server database and used to determine synonyms. See also: Managing Thesaurus Documents
scoring » expand » dictionary The location of the phonetic dictionary that is stored in a database and used when comparing words phonetically. See also: Custom Dictionaries
scoring » expand » distanceThreshold The threshold below which the phonetic difference (distance) between two strings is considered insignificant; i.e., the strings are similar to each other.
scoring » reduce Combinations of properties whose matching values might be a false match. For example, two members of the same family with the same last names and addresses might be misinterpreted as being the same person. In this case, the score is reduced by the assigned weight to give the match less importance.
scoring » reduce » algorithmRef The alias of an algorithm definition under the matchOptions/algorithms node of this step.
scoring » reduce » weight A positive integer that denotes how much to reduce the weight of a match.
scoring » reduce » allMatch The combination of properties that might falsely indicate a match if the values of these properties are equal between two records.
actions Custom actions that can be performed when a threshold is reached. See Custom Match Actions.
actions » action The custom action to perform when a threshold is reached.
actions » action » name The alias for this action definition.
actions » action » function The function to run if this action definition is selected.
actions » action » namespace (Optional) The namespace of the module that contains the function.
actions » action » at The path to the module that contains the function.
thresholds Score thresholds that trigger an action.
thresholds » threshold A score threshold definition, including the action to perform if the threshold is exceeded.
thresholds » threshold » above The score threshold. If the match score exceeds this value, the action is performed.
thresholds » threshold » label The alias for this threshold definition.
thresholds » threshold » action The action to perform if the score is above the threshold. Possible values:
  • notify creates a notification record in the FINAL database with information about the match.
  • merge creates a new record with the combined properties of the original records that match, then archives the old records.
  • The alias of an action definition under the matchOptions/actions node of this step.
tuning » maxScan The maximum number of highest scoring potential matches that will be considered for merging.

Mastering Step - Merging Options

   "mergeOptions" : {
    "matchOptions" : "mlw-match",
    "propertyDefs" : {
      "properties" : [
        {
          "name" : "ssn",
          "localname" : "IdentificationID",
          "namespace" : ""
        },
        {
          "name" : "shallow",
          "path" : "/es:envelope/es:headers/shallow"
        }
      ],
      "namespaces" : {
        "has" : "has",
        "m" : "http://marklogic.com/smart-mastering/merging",
        "es" : "http://marklogic.com/entity-services"
      }
    },
    "algorithms" : {
      "stdAlgorithm" : {
        "timestamp" : { "path" : "/es:envelope/es:headers/sm:sources/sm:source/sm:dateTime" },
        "namespaces" : {
          "sm" : "http://marklogic.com/smart-mastering",
          "es" : "http://marklogic.com/entity-services"
        }
      },
      "custom" : [
        {
          "name" : "customMerge",
          "function" : "doCustomMerge",
          "namespace" : "http://marklogic.com/smart-mastering/merging",
          "at" : "/custom-merge-xqy.xqy"
        }
      ],
      "collections" : {
        "onMerge" : {
          "function" : "collections",
          "namespace" : "test/merge-collection-algorithm",
          "at" : "/test/suites/customizing-collections/lib/merged-collections.xqy"
        },
        "onArchive" : {
          "remove" : { "collection" : ["Entity"] },
          "add" : { "collection" : ["custom-archived"] }
        },
        "onNoMatch" : {
          "function" : "noMatchCollections",
          "namespace" : "",
          "at" : "/test/suites/customizing-collections/lib/noMatchCollections.sjs"
        },
        "onNotification" : {
          "set" : { "collection" : ["notification"] }
        }
      }
    },
    "mergeStrategies" : [
      {
        "name" : "crm-source-weight",
        "algorithmRef" : "standard",
        "sourceWeights" : [
          {
            "source" : {
              "name" : "CRM",
              "weight" : "10"
            }
          }
        ]
      },
      {
        "name" : "length-weight",
        "algorithmRef" : "standard",
        "maxValues" : "1",
        "length" : { "weight" : "10" }
      }
    ],
    "merging" : [
      {
        "propertyName" : "ssn",
        "maxValues" : "1",
        "maxSources" : "1",
        "strategy" : "crm-source-weight"
      },
      {
        "propertyName" : "name",
        "maxValues" : "1",
        "doubleMetaphone" : {
          "distanceThreshold" : "50"
        },
        "synonymsSupport" : "true",
        "thesaurus" : "/mdm/config/thesauri/first-name-synonyms.xml",
        "length" : { "weight" : "8" }
      },
      {
        "propertyName" : "dob",
        "maxValues" : "1",
        "algorithmRef" : "standard",
        "sourceWeights" : {
          "source" : {
            "name" : "better-source",
            "weight" : "4"
          }
        }
      },
      {
        "default" : "true",
        "strategy" : "crm-source-weight"
      }
    ],
    "tripleMerge" : {
      "function" : "custom-trips",
      "namespace" : "http://marklogic.com/smart-mastering/merging",
      "at" : "/custom-triple-merge.xqy",
      "some-param" : 3
    }
  }
Field Description
mergeOptions The settings used to merge records that match. See Smart Mastering Core - Merging Options.
matchOptions The name of a set of match options that were previously stored in the server. See Saving Options.
propertyDefs Definitions of properties to merge.
propertyDefs » properties » name The alias for this property definition.
propertyDefs » properties » localname The name of the XML element or JSON property (record field) to merge.
propertyDefs » properties » namespace (Optional) The namespace that encompasses the XML element or JSON property (record field) to merge.
propertyDefs » properties » path Path leading to the headers or instance sections of records, where the merge properties are defined.
  • XML: /es:envelope/es:headers
  • JSON: /envelope/headers
  • XML: /es:envelope/es:instance
  • JSON: /envelope/instance
Note: Namespaces in the path must be defined in propertyDefs/namespaces node.
propertyDefs » namespaces Key-value pairs that assign aliases to namespaces. The key is the alias and the value is the full namespace.
algorithms Definitions of algorithms that merge values.
algorithms » stdAlgorithm The standard algorithm that implements the default merge behavior.
algorithms » stdAlgorithm » timestamp The path to a timestamp field within the record. This field is used to determine which values to include in the merged property, based on their recency, up to the maximum number specified in the Max Values field in Merge Options (Standard) or in Merge Strategies. Namespaces used in the path must be defined within the record.
algorithms » stdAlgorithm » namespaces (Optional) Key-value pairs that assign aliases to namespaces. The key is the alias and the value is the full namespace.
algorithms » custom Definitions of custom algorithms that merge values.
algorithms » custom » name The alias for this custom algorithm definition.
algorithms » custom » function The custom merge function to run.
algorithms » custom » namespace (Optional) The namespace of the module that contains the function.
algorithms » custom » at The path to the module that contains the function.
algorithms » collections Rules that specify how collection tags are managed when an event occurs.
algorithms » collections » onMerge How collection tags are applied to the new record that was created when matching records are merged. The default set of collection tags is comprised of:
  • the union of collection tags from the original records,
  • plus mdm-content,
  • plus mdm-merged.
algorithms » collections » onMerge » function The function that manage collection tags if the event occurs.
algorithms » collections » onMerge » namespace (Optional) The namespace of the module that contains the function.
algorithms » collections » onMerge » at The path to the module that contains the function.
algorithms » collections » onArchive How collection tags are applied to the original records after their content have been merged into a new record. The default set of collection tags is comprised of:
  • the collection tags from the original record,
  • minus mdm-content,
  • plus mdm-archived.
algorithms » collections » onArchive » remove One or more collection tags to remove from the default union of tags.
algorithms » collections » onArchive » add One or more collection tags to add to the default union of tags.
algorithms » collections » onNoMatch How collection tags are applied to records that were not merged because no matches were found or because the total matching scores did not exceed the defined thresholds. The default set of collection tags is comprised of:
  • the collection tags from the original record,
  • plus mdm-content.
algorithms » collections » onNoMatch » function The function that manage collection tags if the event occurs.
algorithms » collections » onNoMatch » namespace (Optional) The namespace of the module that contains the function.
algorithms » collections » onNoMatch » at The path to the module that contains the function.
algorithms » collections » onNotification How collection tags are applied to notification records. The default set of collection tags is comprised of mdm-notification only.
algorithms » collections » onNotification » set One or more collection tags to replace the default union of tags.
mergeStrategies Predefined configurations for merging.
mergeStrategies » name The name for the strategy.
mergeStrategies » algorithmRef The alias of an algorithm definition under the mergeOptions/algorithms node of this step.
mergeStrategies » sourceWeights The list of data sources and the weights assigned to them.
mergeStrategies » sourceWeights » source » name The name of the source.
mergeStrategies » sourceWeights » source » weight The weight used to decide the priority of a source when merging.
mergeStrategies » maxValues The maximum number of values to allow in the merged property. Default is 99.
mergeStrategies » length » weight The weight assigned to the length of a string.
merging Rules that specify how to merge records that match.
merging » propertyName The alias of a property definition under the mergeOptions/propertyDefs node of this step.
merging » maxValues The maximum number of values to allow in the merged property. Default is 99.
merging » maxSources The maximum number of data sources from which to get values to merge. For example, to copy values from a single source, set maxSources to 1.
merging » strategy The alias of a strategy definition under the mergeOptions/mergeStrategies node of this step.
merging » doubleMetaphone If this setting is present, the Double Metaphone algorithm is used to determine the values to merge.
merging » doubleMetaphone » distanceThreshold The threshold below which the phonetic difference (distance) between two strings is considered insignificant; i.e., the strings are similar to each other.
merging » synonymsSupport If true, synonyms are included in the list of values to merge. Synonyms are determined using the specified thesaurus.
merging » thesaurus The location of the thesaurus that is stored in a MarkLogic Server database and used to determine synonyms. See also: Managing Thesaurus Documents
merging » length The weight assigned to the length of a string.
merging » algorithmRef The alias of an algorithm definition under the mergeOptions/algorithms node of this step.
merging » sourceWeights The list of data sources and the weights assigned to them. If the set of matching records come from more sources than maxSources, the source weights are used to determine which records are included in the merge.
merging » sourceWeights » source » name The name of the source.
merging » sourceWeights » source » weight The weight used to decide the priority of a source when merging.
merging » default If true, the specified strategy is the default. Important: If this setting is present, do not include a propertyName setting.
merging » strategy The alias of a strategy definition under the mergeOptions/mergeStrategies node of this step.
tripleMerge Definition of an algorithm that merges triples.
tripleMerge » function The function that merges triples.
tripleMerge » namespace The namespace of the module that contains the function.
tripleMerge » at The path to the module that contains the function.
tripleMerge » some-param Parameters, as key-value pairs, to pass to your triple merge function.

Custom Step Settings

   "4" : {
    "name" : "custom-step",
    "description" : "",
    "stepDefinitionName" : "custom-step-def",
    "stepDefinitionType" : "CUSTOM",
    "retryLimit" : null
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "sourceDatabase" : "data-hub-STAGING",
      "targetDatabase" : "data-hub-FINAL",
      "sourceQuery" : "cts.collectionQuery('my-custom-query')",
      "outputFormat" : "json",
      "collections" : [ "my-collection-tag" ]
    }
  }
Field Description
name The name of the step instance.
description A description of the step.
stepDefinitionName The name of the step definition.
stepDefinitionType The type of the step definition: INGESTION, MAPPING, MASTERING, or CUSTOM.
retryLimit The maximum number of times to retry running the step if the previous run failed.
batchSize The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
threadCount The number of threads to use when running a flow. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
customHook Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module The path to your custom hook module.
customHook » parameters Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user The user account to use to run the module. Default is the user running the flow; e.g., flow-operator.
customHook » runBefore For a pre-step hook, set to true. For a post-step hook, set to false.
options » sourceDatabase For mapping, choose the STAGING database where you obtained the data. Default is data-hub-STAGING.
options » targetDatabase For mapping, choose the FINAL database where you want to store mapped data. Default is data-hub-FINAL.
options » sourceQuery The CTS query to use to select the source data to process. See CTS Query.
options » outputFormat The format that you want your data to be stored as.
options » collections The collection tags to assign to the resulting records.

Full Flow Definition

The default flow definition file created by Gradle contains the most common settings. The following flow definition includes additional settings that are also allowed in the flow definition.

Hover over a line in the code below to view information about that property, if available. Click to make the info box sticky; click again to unstick.

   {
    "name" : "your-flow-name",
The human-friendly name of the flow.
"description" : "This is the default flow containing all of the default steps, generated by hubCreateFlow, plus additional settings and examples.",
(Optional) A description of the flow.
"batchSize" : 100,
The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts.
"threadCount" : 4,
The number of threads to use when running a flow.
"options" : {
Key-value pairs to pass as parameters to custom modules in every step in the flow.
"sourceQuery" : null
The CTS query to use to select the source data to process. See CTS Query.
}, "steps" : {
The steps to be run within the flow. Each step in the flow has a sequence number and a customized copy of the step definition. The step definition differs based on the step type (ingestion, mapping, mastering, or custom).
"1" : {
A string containing a number which represents the order of the step in the sequence. Note: The steps can be listed in any order, as long as the keys are unique within the steps node of the flow. Duplicate keys can produced unexpected results. The key number must be greater than 0.
"name" : "ingestion-step",
The name of the step instance.
"description" : "This is my ingestion step.",
A description of the step.
"stepDefinitionName" : "default-ingestion",
The name of the step definition.
"stepDefinitionType" : "INGESTION",
The type of the step definition: INGESTION, MAPPING, MASTERING, or CUSTOM.
"batchSize" : 100,
The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
"threadCount" : 4,
The number of threads to use when running a flow. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
"customHook" : {
Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
"module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
The path to your custom hook module.
"parameters" : {},
Parameters, as key-value pairs, to pass to your custom hook module.
"user" : "flow-operator",
The user account to use to run the module. Default is the user running the flow; e.g., flow-operator.
"runBefore" : false
For a pre-step hook, set to true. For a post-step hook, set to false.
}, "options" : { "targetDatabase" : "data-hub-STAGING",
For ingestion, choose the STAGING database where you want to store the ingested data. Default is data-hub-STAGING.
"sourceQuery" : null,
The CTS query to use to select the source data to process. See CTS Query.
"outputFormat" : "json",
The format that you want your data to be stored as.
"collections" : [ "default-ingestion" ]
The collection tags to assign to the resulting records.
}, "fileLocations" : { "inputFilePath" : "path/to/folder",
The location of your source files.
"inputFileType" : "json",
The format of your source files.
"outputURIReplacement" : "output/URI,'substitute/URI'"
A comma-separated list of replacements used to customize the URIs of the ingested records. The list is comprised of regular expression patterns and their replacement strings in the format pattern,'string',pattern,'string',.... The replacement strings must be enclosed in single quotes.
} }, "2" : {
A string containing a number which represents the order of the step in the sequence. Note: The steps can be listed in any order, as long as the keys are unique within the steps node of the flow. Duplicate keys can produced unexpected results. The key number must be greater than 0.
"name" : "mapping-step",
The name of the step instance.
"description" : "This is my mapping step.",
A description of the step.
"stepDefinitionName" : "default-mapping",
The name of the step definition.
"stepDefinitionType" : "MAPPING",
The type of the step definition: INGESTION, MAPPING, MASTERING, or CUSTOM.
"batchSize" : 100,
The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
"threadCount" : 4,
The number of threads to use when running a flow. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
"customHook" : {
Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
"module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
The path to your custom hook module.
"parameters" : {},
Parameters, as key-value pairs, to pass to your custom hook module.
"user" : "flow-operator",
The user account to use to run the module. Default is the user running the flow; e.g., flow-operator.
"runBefore" : false
For a pre-step hook, set to true. For a post-step hook, set to false.
}, "options" : { "sourceDatabase" : "data-hub-STAGING",
For mapping, choose the STAGING database where you obtained the data. Default is data-hub-STAGING.
"targetDatabase" : "data-hub-FINAL",
For mapping, choose the FINAL database where you want to store mapped data. Default is data-hub-FINAL.
"sourceQuery" : "cts.collectionQuery('default-ingestion')",
The CTS query to use to select the source data to process. See CTS Query.
"collections" : [ "default-mapping", "mdm-content" ],
The collection tags to assign to the resulting records.
"targetEntity" : "entity-name",
The entity to map against the source data.
"mapping" : {
How to map the properties of the targetEntity to the fields of the source data.
"name" : "mapping-name",
The name of your mapping that is defined in your-project-root/mappings/your-mapping-name/mapping.version.json.
"version" : 1
The version of the mapping to use. Your mapping must be defined in your-project-root/mappings/your-mapping-name/mapping.version.json.
} } }, "3" : {
A string containing a number which represents the order of the step in the sequence. Note: The steps can be listed in any order, as long as the keys are unique within the steps node of the flow. Duplicate keys can produced unexpected results. The key number must be greater than 0.
"name" : "mastering-step",
The name of the step instance.
"description" : "This is my mastering step.",
A description of the step.
"stepDefinitionName" : "default-mastering",
The name of the step definition.
"stepDefinitionType" : "MASTERING",
The type of the step definition: INGESTION, MAPPING, MASTERING, or CUSTOM.
"batchSize" : 100,
The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
"threadCount" : 4,
The number of threads to use when running a flow. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
"customHook" : {
Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
"module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
The path to your custom hook module.
"parameters" : {},
Parameters, as key-value pairs, to pass to your custom hook module.
"user" : "flow-operator",
The user account to use to run the module. Default is the user running the flow; e.g., flow-operator.
"runBefore" : false
For a pre-step hook, set to true. For a post-step hook, set to false.
}, "options" : { "sourceDatabase" : "data-hub-FINAL",
For mastering, choose the FINAL database where you stored processed data. Default is data-hub-FINAL.
"targetDatabase" : "data-hub-FINAL",
For mastering, choose the FINAL database where you want to store mastered data. Default is data-hub-FINAL. Note: For a mastering step, the source database and the target database must be the same. If duplicates are found, the original records are archived and the merged version is added to the same database. If you want the target database to be different, you can create a custom step with a custom module to override the default behavior of the mastering step.
"sourceQuery" : "cts.andQuery([cts.collectionQuery('default-mapping'),cts.collectionQuery('mdm-content')])",
The CTS query to use to select the source data to process. See CTS Query.
"collections" : [ "default-mastering, mastered" ],
The collection tags to assign to the resulting records.
"targetEntity" : "entity-name",
The entity to map against the source data.
"matchOptions" : {
The settings used to find potential matches. See Smart Mastering Core - Matching Options.
"dataFormat" : "json",
The format of your source files.
"propertyDefs" : {
Definitions of properties to compare.
"property" : [ { "name" : "ssn",
The alias for this property definition.
"namespace" : "",
(Optional) The namespace that encompasses the XML element or JSON property (record field) to compare.
"localname" : "IdentificationID"
The name of the XML element or JSON property (record field) to compare.
}, ] }, "algorithms" : {
Definitions of algorithms that compare values. Each algorithm corresponds to a match type (Exact, Synonym, Double Metaphone, Reduce, Zip, and Custom). The default algorithm is that of the Exact match type, which determines whether two values are equal.
"algorithm" : [ { "name" : "std-reduce",
The alias for this algorithm definition.
"function" : "standard-reduction",
The function to run if this algorithm definition is selected.
"namespace" : "",
(Optional) The namespace of the module that contains the function.
"at" : ""
The path to the module that contains the function.
}, ] }, "collections" : {
A set of collections that override the default collection used to determine the scope of the dataset being compared. If multiple content elements are specified, the dataset is restricted to an intersection of those collections.
"content" : [ "my-content-collection" ]
One or more collections used to determine the dataset to be compared.
}, "scoring" : {
Rules (add, expand, reduce) that define how the comparison is scored based on assigned weights. The maximum possible score is the sum of the weights of all of the weight attributes. The match process uses the simple scoring option, with the property weight controlling how much influence each should have. See Relevance Scores.
"add" : [
Properties whose values are simply compared between the records and, if the values match exactly, the assigned weight is added to the score.
{ "propertyName" : "ssn",
The alias of a property definition under the matchOptions/propertyDefs node of this step.
"weight" : "50"
The weight added to the score if the property values of two records match exactly.
}, ], "expand" : [
Properties whose values are compared using a different algorithm to determine a match. For example, the property values can be considered a positive match if one is a synonym of the other or if both values phonetically sound alike. If so, the assigned weight is added to the score.
{ "propertyName" : "first-name",
The alias of a property definition under the matchOptions/propertyDefs node of this step.
"algorithmRef" : "thesaurus",
The alias of an algorithm definition under the matchOptions/algorithms node of this step.
"weight" : "6",
The weight added to the score if the property values of two records are considered a match based on the selected algorithm.
"thesaurus" : "/mdm/config/thesauri/first-name-synonyms.xml"
The location of the thesaurus that is stored in a MarkLogic Server database and used to determine synonyms. See also: Managing Thesaurus Documents
}, { "propertyName" : "last-name",
The alias of a property definition under the matchOptions/propertyDefs node of this step.
"algorithmRef" : "dbl-metaphone",
The alias of an algorithm definition under the matchOptions/algorithms node of this step.
"weight" : "8",
The weight added to the score if the property values of two records are considered a match based on the selected algorithm.
"dictionary" : "name-dictionary.xml",
The location of the phonetic dictionary that is stored in a database and used when comparing words phonetically. See also: Custom Dictionaries
"distanceThreshold" : "50"
The threshold below which the phonetic difference (distance) between two strings is considered insignificant; i.e., the strings are similar to each other.
} ], "reduce" : [
Combinations of properties whose matching values might be a false match. For example, two members of the same family with the same last names and addresses might be misinterpreted as being the same person. In this case, the score is reduced by the assigned weight to give the match less importance.
{ "algorithmRef" : "std-reduce",
The alias of an algorithm definition under the matchOptions/algorithms node of this step.
"weight" : "4",
A positive integer that denotes how much to reduce the weight of a match.
"allMatch" : { "property" : ["last-name", "addr1"] }
The combination of properties that might falsely indicate a match if the values of these properties are equal between two records.
} ] }, "actions" : {
Custom actions that can be performed when a threshold is reached. See Custom Match Actions.
"action" : [
The custom action to perform when a threshold is reached.
{ "name" : "my-custom-action",
The alias for this action definition.
"function" : "custom-action",
The function to run if this action definition is selected.
"namespace" : "http://marklogic.com/smart-mastering/action",
(Optional) The namespace of the module that contains the function.
"at" : "/custom-action.xqy"
The path to the module that contains the function.
} ] }, "thresholds" : {
Score thresholds that trigger an action.
"threshold" : [
A score threshold definition, including the action to perform if the threshold is exceeded.
{ "above" : "30", "label" : "Possible Match" }, { "above" : "50", "label" : "Likely Match", "action" : "notify" }, { "above" : "68", "label" : "Definitive Match", "action" : "merge" }, { "above" : "75",
The score threshold. If the match score exceeds this value, the action is performed.
"label" : "Custom Match",
The alias for this threshold definition.
"action" : "my-custom-action"
The action to perform if the score is above the threshold. Possible values:
  • notify creates a notification record in the FINAL database with information about the match.
  • merge creates a new record with the combined properties of the original records that match, then archives the old records.
  • The alias of an action definition under the matchOptions/actions node of this step.
} ] }, "tuning" : { "maxScan" : 200
The maximum number of highest scoring potential matches that will be considered for merging.
} }, "mergeOptions" : {
The settings used to merge records that match. See Smart Mastering Core - Merging Options.
"matchOptions" : "mlw-match",
The name of a set of match options that were previously stored in the server. See Saving Options.
"propertyDefs" : {
Definitions of properties to merge.
"properties" : [ { "name" : "ssn",
The alias for this property definition.
"localname" : "IdentificationID",
The name of the XML element or JSON property (record field) to merge.
"namespace" : ""
(Optional) The namespace that encompasses the XML element or JSON property (record field) to merge.
}, { "name" : "shallow",
The alias for this property definition.
"path" : "/es:envelope/es:headers/shallow"
Path leading to the headers or instance sections of records, where the merge properties are defined.
  • XML: /es:envelope/es:headers
  • JSON: /envelope/headers
  • XML: /es:envelope/es:instance
  • JSON: /envelope/instance
Note: Namespaces in the path must be defined in propertyDefs/namespaces node.
} ], "namespaces" : {
Key-value pairs that assign aliases to namespaces. The key is the alias and the value is the full namespace.
"has" : "has", "m" : "http://marklogic.com/smart-mastering/merging", "es" : "http://marklogic.com/entity-services" } }, "algorithms" : {
Definitions of algorithms that merge values.
"stdAlgorithm" : {
The standard algorithm that implements the default merge behavior.
"timestamp" : { "path" : "/es:envelope/es:headers/sm:sources/sm:source/sm:dateTime" },
The path to a timestamp field within the record. This field is used to determine which values to include in the merged property, based on their recency, up to the maximum number specified in the Max Values field in Merge Options (Standard) or in Merge Strategies. Namespaces used in the path must be defined within the record.
"namespaces" : {
(Optional) Key-value pairs that assign aliases to namespaces. The key is the alias and the value is the full namespace.
"sm" : "http://marklogic.com/smart-mastering", "es" : "http://marklogic.com/entity-services" } }, "custom" : [
Definitions of custom algorithms that merge values.
{ "name" : "customMerge",
The alias for this custom algorithm definition.
"function" : "doCustomMerge",
The custom merge function to run.
"namespace" : "http://marklogic.com/smart-mastering/merging",
(Optional) The namespace of the module that contains the function.
"at" : "/custom-merge-xqy.xqy"
The path to the module that contains the function.
} ], "collections" : {
Rules that specify how collection tags are managed when an event occurs.
"onMerge" : {
How collection tags are applied to the new record that was created when matching records are merged. The default set of collection tags is comprised of:
  • the union of collection tags from the original records,
  • plus mdm-content,
  • plus mdm-merged.
"function" : "collections",
The function that manage collection tags if the event occurs.
"namespace" : "test/merge-collection-algorithm",
(Optional) The namespace of the module that contains the function.
"at" : "/test/suites/customizing-collections/lib/merged-collections.xqy"
The path to the module that contains the function.
}, "onArchive" : {
How collection tags are applied to the original records after their content have been merged into a new record. The default set of collection tags is comprised of:
  • the collection tags from the original record,
  • minus mdm-content,
  • plus mdm-archived.
"remove" : { "collection" : ["Entity"] },
One or more collection tags to remove from the default union of tags.
"add" : { "collection" : ["custom-archived"] }
One or more collection tags to add to the default union of tags.
}, "onNoMatch" : {
How collection tags are applied to records that were not merged because no matches were found or because the total matching scores did not exceed the defined thresholds. The default set of collection tags is comprised of:
  • the collection tags from the original record,
  • plus mdm-content.
"function" : "noMatchCollections",
The function that manage collection tags if the event occurs.
"namespace" : "",
(Optional) The namespace of the module that contains the function.
"at" : "/test/suites/customizing-collections/lib/noMatchCollections.sjs"
The path to the module that contains the function.
}, "onNotification" : {
How collection tags are applied to notification records. The default set of collection tags is comprised of mdm-notification only.
"set" : { "collection" : ["notification"] }
One or more collection tags to replace the default union of tags.
} } }, "mergeStrategies" : [
Predefined configurations for merging.
{ "name" : "crm-source-weight",
The name for the strategy.
"algorithmRef" : "standard",
The alias of an algorithm definition under the mergeOptions/algorithms node of this step.
"sourceWeights" : [
The list of data sources and the weights assigned to them.
{ "source" : { "name" : "CRM",
The name of the source.
"weight" : "10"
The weight used to decide the priority of a source when merging.
} } ] }, { "name" : "length-weight", "algorithmRef" : "standard", "maxValues" : "1",
The maximum number of values to allow in the merged property. Default is 99.
"length" : { "weight" : "10" }
The weight assigned to the length of a string.
} ], "merging" : [
Rules that specify how to merge records that match.
{ "propertyName" : "ssn",
The alias of a property definition under the mergeOptions/propertyDefs node of this step.
"maxValues" : "1",
The maximum number of values to allow in the merged property. Default is 99.
"maxSources" : "1",
The maximum number of data sources from which to get values to merge. For example, to copy values from a single source, set maxSources to 1.
"strategy" : "crm-source-weight"
The alias of a strategy definition under the mergeOptions/mergeStrategies node of this step.
}, { "propertyName" : "name", "maxValues" : "1", "doubleMetaphone" : {
If this setting is present, the Double Metaphone algorithm is used to determine the values to merge.
"distanceThreshold" : "50"
The threshold below which the phonetic difference (distance) between two strings is considered insignificant; i.e., the strings are similar to each other.
}, "synonymsSupport" : "true",
If true, synonyms are included in the list of values to merge. Synonyms are determined using the specified thesaurus.
"thesaurus" : "/mdm/config/thesauri/first-name-synonyms.xml",
The location of the thesaurus that is stored in a MarkLogic Server database and used to determine synonyms. See also: Managing Thesaurus Documents
"length" : { "weight" : "8" }
The weight assigned to the length of a string.
}, { "propertyName" : "dob", "maxValues" : "1", "algorithmRef" : "standard",
The alias of an algorithm definition under the mergeOptions/algorithms node of this step.
"sourceWeights" : {
The list of data sources and the weights assigned to them. If the set of matching records come from more sources than maxSources, the source weights are used to determine which records are included in the merge.
"source" : { "name" : "better-source",
The name of the source.
"weight" : "4"
The weight used to decide the priority of a source when merging.
} } }, { "default" : "true",
If true, the specified strategy is the default. Important: If this setting is present, do not include a propertyName setting.
"strategy" : "crm-source-weight"
The alias of a strategy definition under the mergeOptions/mergeStrategies node of this step.
} ], "tripleMerge" : {
Definition of an algorithm that merges triples.
"function" : "custom-trips",
The function that merges triples.
"namespace" : "http://marklogic.com/smart-mastering/merging",
The namespace of the module that contains the function.
"at" : "/custom-triple-merge.xqy",
The path to the module that contains the function.
"some-param" : 3
Parameters, as key-value pairs, to pass to your triple merge function.
} } } }, "4" : {
A string containing a number which represents the order of the step in the sequence. Note: The steps can be listed in any order, as long as the keys are unique within the steps node of the flow. Duplicate keys can produced unexpected results. The key number must be greater than 0.
"name" : "custom-step",
The name of the step instance.
"description" : "",
A description of the step.
"stepDefinitionName" : "custom-step-def",
The name of the step definition.
"stepDefinitionType" : "CUSTOM",
The type of the step definition: INGESTION, MAPPING, MASTERING, or CUSTOM.
"retryLimit" : null
The maximum number of times to retry running the step if the previous run failed.
"batchSize" : 100,
The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
"threadCount" : 4,
The number of threads to use when running a flow. If not defined or if set to 0 or null in the step settings, the value in the flow settings is used.
"customHook" : {
Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
"module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
The path to your custom hook module.
"parameters" : {},
Parameters, as key-value pairs, to pass to your custom hook module.
"user" : "flow-operator",
The user account to use to run the module. Default is the user running the flow; e.g., flow-operator.
"runBefore" : false
For a pre-step hook, set to true. For a post-step hook, set to false.
}, "options" : { "sourceDatabase" : "data-hub-STAGING",
For mapping, choose the STAGING database where you obtained the data. Default is data-hub-STAGING.
"targetDatabase" : "data-hub-FINAL",
For mapping, choose the FINAL database where you want to store mapped data. Default is data-hub-FINAL.
"sourceQuery" : "cts.collectionQuery('my-custom-query')",
The CTS query to use to select the source data to process. See CTS Query.
"outputFormat" : "json",
The format that you want your data to be stored as.
"collections" : [ "my-collection-tag" ]
The collection tags to assign to the resulting records.
} } } }