Flow Definition File

Overview

The default flow definition file generated by the Gradle task hubCreateFlow includes flow settings, as well as example steps.

You must customize the example steps before running the flow. You can delete the steps you don't need, and you can duplicate the steps if you need multiple steps of the same type. However, you must assign a unique sequence number for each step.

To learn about flows, see About Flows.

To learn about step types, see About Steps.

Flow Settings

   {
    "name" : "MyFlow",
    "description" : "This flow contains examples of steps plus additional settings.",
    "batchSize" : 100,
    "threadCount" : 4,
    "stopOnError" : false,
    "options" : {
      "sourceQuery" : null,
      "provenanceGranularityLevel" : "fine"
    },
    "steps" : {
      "1" : { ... },
      "2" : { ... },
      "3" : { ... },
      "4" : { ... }
    }
  }


Field	Description
name	The human-friendly name of the flow.
description	(Optional) A description of the flow.
batchSize	The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts.
threadCount	The number of threads to use when running a flow.
stopOnError	If `true` and an error is encountered, the flow run ends, the rest of the source data is ignored, and the remaining steps are not performed. Information about the failure is logged in the job document. Default is `false`.
options	Key-value pairs to pass as parameters to custom modules in every step in the flow.
options » sourceCollection	The collection tag to use to search for the records to process in this step.
options » sourceQuery	The CTS query to use to select the source data to process. To filter by a collection tag, use `cts.collectionQuery('my-collection-name')`. Example: `"sourceQuery" : "cts.collectionQuery('default-ingestion')"` See CTS Query.
options » provenanceGranularityLevel	The granularity of the provenance tracking information: `coarse` (default) to store document-level provenance information only, `fine` to store document-level and property-level provenance information, or `off` to disable provenance tracking in future job runs. Applies only to mapping, matching, merging, mastering, and custom steps.
steps	The steps to be run within the flow. Each step in the flow has a sequence number and a customized copy of the step definition. The step definition differs based on the step type (ingestion, mapping, matching, merging, mastering, or custom).

Ingestion Step Settings

   "1" : {
    "name" : "MyIngestionStep",
    "description" : "This is my ingestion step.",
    "stepDefinitionName" : "default-ingestion",
    "stepDefinitionType" : "INGESTION",
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "stepUpdate" : false,
      "acceptsBatch" : false,
      "targetDatabase" : "data-hub-STAGING",
      "collections" : [ "default-ingestion" ],
      "additionalCollections" : [],
      "outputFormat" : "json"
    },
    "fileLocations" : {
      "inputFilePath" : "path/to/folder",
      "inputFileType" : "json",
      "outputURIReplacement" : "output/URI,'substitute/URI'"
    }
  },


Field	Description
name	The name of the step instance.
description	A description of the step.
stepDefinitionName	The name of a step definition, which could be one of the default step definitions or a custom one. Custom step definitions can be created using QuickStart or using the Gradle task hubCreateStepDefinition. Tip: If you are customizing a default step type (ingestion, mapping, or mastering), leave the value as `default-ingestion`, `default-mapping`, or `default-mastering`.
stepDefinitionType	The type of the step definition: INGESTION, MAPPING, MATCHING, MERGING, MASTERING, or CUSTOM.
batchSize	The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
threadCount	The number of threads to use when running a flow. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
customHook	Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module	The path to your custom hook module.
customHook » parameters	Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user	The user account to use to run the module. Default is the user running the flow; e.g., `flow-operator`.
customHook » runBefore	For a pre-step hook, set to true. For a post-step hook, set to false.
options » stepUpdate	If `true`, custom modules can make changes directly to records in the database (inserting, deleting, or locking); otherwise, custom modules can make changes indirectly by passing content objects to Data Hub APIs. Direct changes to the database are rarely needed. For a combined mastering step, the default is `true`. For all other types of steps, the default is `false`.
options » acceptsBatch	If `true`, all the records in the batch are processed within a single step run; otherwise, the step is restarted and run for each record in the batch.
options » targetDatabase	Choose the STAGING database where you want to store the ingested data. Default is `data-hub-STAGING`.
options » collections	The collection tags to assign to the resulting records.
additionalCollections	The collection tags to assign to the resulting records, in addition to the default collections.
options » outputFormat	The format of the processed record: Text, JSON, XML, or Binary.
fileLocations » inputFilePath	The location of your source files.
fileLocations » inputFileType	The format of your source files: Text, JSON, XML, Binary, or Delimited Text.
fileLocations » outputURIReplacement	A comma-separated list of replacements used to customize the URIs of the ingested records. The list is comprised of regular expression patterns and their replacement strings in the format `pattern,'string',pattern,'string',...`. The replacement strings must be enclosed in single quotes.

Mapping Step Settings

   "2" : {
    "name" : "MyMappingStep",
    "description" : "This is my mapping step.",
    "stepDefinitionName" : "default-mapping",
    "stepDefinitionType" : "MAPPING",
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "stepUpdate" : false,
      "acceptsBatch" : false,
      "sourceDatabase" : "data-hub-STAGING",
      "sourceCollection" : "MyIngestionStep",
      "sourceQuery" : "cts.collectionQuery('my-custom-query')",
      "constrainSourceQueryToJob" : false,
      "targetEntity" : "MyEntity",
      "validateEntity" : false,
      "targetDatabase" : "data-hub-FINAL",
      "collections" : [ "default-mapping" ],
      "additionalCollections" : [],
      "outputFormat" : "json",
      "provenanceGranularityLevel" : "fine",
      "mapping" : {
        "name" : "mapping-name",
        "version" : "1"
      }
    }
  },


Field	Description
name	The name of the step instance.
description	A description of the step.
stepDefinitionName	The name of a step definition, which could be one of the default step definitions or a custom one. Custom step definitions can be created using QuickStart or using the Gradle task hubCreateStepDefinition. Tip: If you are customizing a default step type (ingestion, mapping, or mastering), leave the value as `default-ingestion`, `default-mapping`, or `default-mastering`.
stepDefinitionType	The type of the step definition: INGESTION, MAPPING, MATCHING, MERGING, MASTERING, or CUSTOM.
batchSize	The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
threadCount	The number of threads to use when running a flow. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
customHook	Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module	The path to your custom hook module.
customHook » parameters	Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user	The user account to use to run the module. Default is the user running the flow; e.g., `flow-operator`.
customHook » runBefore	For a pre-step hook, set to true. For a post-step hook, set to false.
options » stepUpdate	If `true`, custom modules can make changes directly to records in the database (inserting, deleting, or locking); otherwise, custom modules can make changes indirectly by passing content objects to Data Hub APIs. Direct changes to the database are rarely needed. For a combined mastering step, the default is `true`. For all other types of steps, the default is `false`.
options » acceptsBatch	If `true`, all the records in the batch are processed within a single step run; otherwise, the step is restarted and run for each record in the batch.
options » sourceDatabase	Choose the STAGING database where you stored ingested data. Default is `data-hub-STAGING`.
options » sourceCollection	The collection tag to use to search for the records to process in this step.
options » sourceQuery	The CTS query to use to select the source data to process. To filter by a collection tag, use `cts.collectionQuery('my-collection-name')`. Example: `"sourceQuery" : "cts.collectionQuery('default-ingestion')"` See CTS Query.
options » constrainSourceQueryToJob	If `true`, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if `sourceQuery` is `cts.collectionQuery('example')` and `constrainSourceQueryToJob` is `true`, the query searches for documents that are in the `example` collection and were created or modified in the current job. Default is `false`.
options » targetEntity	The entity to map against the source data.
options » validateEntity	Indicates whether to validate the mapped entity instance against the schema document based on the entity model, and what action to take. Set to `false` to skip validation. Set to `accept` to write the mapped entity instance to the database regardless of the validation result. Set to `reject` to skip writing the mapped entity instance to the database if the validation failed. Default is `false`. See about-mapping.html#about-mapping__validation-of-mapped-expressions.
options » targetDatabase	Choose the FINAL database where you want to store mapped data. Default is `data-hub-FINAL`.
options » collections	The collection tags to assign to the resulting records.
additionalCollections	The collection tags to assign to the resulting records, in addition to the default collections.
options » outputFormat	The format of the processed record: Text, JSON, XML, or Binary.
options » provenanceGranularityLevel	The granularity of the provenance tracking information: `coarse` (default) to store document-level provenance information only, `fine` to store document-level and property-level provenance information, or `off` to disable provenance tracking in future job runs.
options » mapping	How to map the properties of the `targetEntity` to the fields of the source data.
options » mapping » name	The name of your mapping that is defined in your-project-root/mappings/your-mapping-name/mapping.version.json.
options » mapping » version	The version of the mapping to use. Your mapping must be defined in your-project-root/mappings/your-mapping-name/mapping.version.json.

Matching Step Settings

   "3" : {
    "name" : "MyMatchingStep",
    "description" : "This is my matching step.",
    "stepDefinitionName" : "default-matching",
    "stepDefinitionType" : "MATCHING",
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "stepUpdate" : false,
      "acceptsBatch" : false,
      "sourceDatabase" : "data-hub-FINAL",
      "sourceCollection" : "MyMappingStep",
      "sourceQuery" : "cts.collectionQuery('my-custom-query')",
      "constrainSourceQueryToJob" : false,
      "targetEntity" : "MyEntity",
      "targetDatabase" : "data-hub-FINAL",
      "collections" : [ "MyMatchingStep", "MyPersonEntity" ],
      "additionalCollections" : [],
      "provenanceGranularityLevel" : "fine",
      "matchOptions" : { ... }
    }
  },


Field	Description
name	The name of the step instance.
description	A description of the step.
stepDefinitionName	The name of a step definition, which could be one of the default step definitions or a custom one. Custom step definitions can be created using QuickStart or using the Gradle task hubCreateStepDefinition. Tip: If you are customizing a default step type (ingestion, mapping, or mastering), leave the value as `default-ingestion`, `default-mapping`, or `default-mastering`.
stepDefinitionType	The type of the step definition: INGESTION, MAPPING, MATCHING, MERGING, MASTERING, or CUSTOM.
batchSize	The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
threadCount	The number of threads to use when running a flow. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
customHook	Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module	The path to your custom hook module.
customHook » parameters	Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user	The user account to use to run the module. Default is the user running the flow; e.g., `flow-operator`.
customHook » runBefore	For a pre-step hook, set to true. For a post-step hook, set to false.
options » stepUpdate	If `true`, custom modules can make changes directly to records in the database (inserting, deleting, or locking); otherwise, custom modules can make changes indirectly by passing content objects to Data Hub APIs. Direct changes to the database are rarely needed. For a combined mastering step, the default is `true`. For all other types of steps, the default is `false`.
options » acceptsBatch	If `true`, all the records in the batch are processed within a single step run; otherwise, the step is restarted and run for each record in the batch.
options » sourceDatabase	Choose the FINAL database where you stored mapped data. Default is `data-hub-FINAL`.
options » sourceCollection	The collection tag to use to search for the records to process in this step.
options » sourceQuery	The CTS query to use to select the source data to process. To filter by a collection tag, use `cts.collectionQuery('my-collection-name')`. Example: `"sourceQuery" : "cts.collectionQuery('default-ingestion')"` See CTS Query.
options » constrainSourceQueryToJob	If `true`, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if `sourceQuery` is `cts.collectionQuery('example')` and `constrainSourceQueryToJob` is `true`, the query searches for documents that are in the `example` collection and were created or modified in the current job. Default is `false`.
options » targetEntity	The entity to map against the source data.
options » targetDatabase	Choose the same database you selected in Source Database. Default is `data-hub-FINAL`. Note: For split mastering (matching step and merging step), both the source database and the target database for both steps must be the same.
options » collections	The collection tags to assign to the resulting records.
additionalCollections	The collection tags to assign to the resulting records, in addition to the default collections.
options » provenanceGranularityLevel	The granularity of the provenance tracking information: `coarse` (default) to store document-level provenance information only, `fine` to store document-level and property-level provenance information, or `off` to disable provenance tracking in future job runs.

Matching Options

   "matchOptions" : {
    "dataFormat" : "json",
    "propertyDefs" : {
      "property" : [
        {
          "name" : "ssn",
          "namespace" : "",
          "localname" : "IdentificationID"
        },
      ]
    },
    "algorithms" : {
      "algorithm" : [
        {
          "name" : "std-reduce",
          "function" : "standard-reduction",
          "namespace" : "",
          "at" : ""
        },
      ]
    },
    "collections" : {
      "content" : [ "my-content-collection" ]
    },
    "scoring" : {
      "add" : [
        {
          "propertyName" : "ssn",
          "weight" : "50"
        },
      ],
      "expand" : [
        {
          "propertyName" : "first-name",
          "algorithmRef" : "thesaurus",
          "weight" : "6",
          "thesaurus" : "/mdm/config/thesauri/first-name-synonyms.xml"
        },
        {
          "propertyName" : "last-name",
          "algorithmRef" : "dbl-metaphone",
          "weight" : "8",
          "dictionary" : "name-dictionary.xml",
          "distanceThreshold" : "50"
        }
      ],
      "reduce" : [
        {
          "algorithmRef" : "std-reduce",
          "weight" : "4",
          "allMatch" : { "property" : ["last-name", "addr1"] }
        }
      ]
    },
    "actions" : {
      "action" : [
        {
          "name" : "my-custom-action",
          "function" : "custom-action",
          "namespace" : "http://marklogic.com/smart-mastering/action",
          "at" : "/custom-action.xqy"
        }
      ]
    },
    "thresholds" : {
      "threshold" : [
        { "above" : "30", "label" : "Possible Match" },
        { "above" : "50", "label" : "Likely Match", "action" : "notify" },
        { "above" : "68", "label" : "Definitive Match", "action" : "merge" },
        { "above" : "75", "label" : "Custom Match", "action" : "my-custom-action" }
      ]
    },
    "tuning" : {
      "maxScan" : 200
    }
  },


Field	Description
matchOptions	The settings used to find potential matches. See Smart Mastering Core - Matching Options.
dataFormat	The format of your source records: Text, JSON, XML, or Binary.
propertyDefs	Definitions of properties to compare.
propertyDefs » property » name	The alias for this property definition.
propertyDefs » property » namespace	(Optional) The namespace that encompasses the XML element or JSON property (record field) to compare.
propertyDefs » property » localname	The name of the XML element or JSON property (record field) to compare.
algorithms	Definitions of algorithms that compare values. Each algorithm corresponds to a match type (Exact, Synonym, Double Metaphone, Reduce, Zip, and Custom). The default algorithm is that of the Exact match type, which determines whether two values are equal.
algorithms » algorithm » name	The alias for this algorithm definition.
algorithms » algorithm » function	The function to run if this algorithm definition is selected.
algorithms » algorithm » namespace	(Optional) The namespace of the module that contains the function.
algorithms » algorithm » at	The path to the module that contains the function.
collections	A set of collections that overrides the default collection used to determine the scope of the dataset being compared. If multiple content elements are specified, the dataset is restricted to an intersection of those collections.
collections » content	One or more collections used to determine the dataset to be compared.
scoring	Rules (add, expand, reduce) that define how the comparison is scored based on assigned weights. The maximum possible score is the sum of the weights of all of the weight attributes. The match process uses the simple scoring option, with the property weight controlling how much influence each should have. See Relevance Scores.
scoring » add	Properties whose values are simply compared between the records and, if the values match exactly, the assigned weight is added to the score.
scoring » add » propertyName	The alias of a property definition under the matchOptions/propertyDefs node of this step.
scoring » add » weight	The weight added to the score if the property values of two records match exactly.
scoring » expand	Properties whose values are compared using a different algorithm to determine a match. For example, the property values can be considered a positive match if one is a synonym of the other or if both values phonetically sound alike. If so, the assigned weight is added to the score.
scoring » expand » propertyName	The alias of a property definition under the matchOptions/propertyDefs node of this step.
scoring » expand » algorithmRef	The alias of an algorithm definition under the matchOptions/algorithms node of this step.
scoring » expand » weight	The weight added to the score if the property values of two records are considered a match based on the selected algorithm.
scoring » expand » thesaurus	The location of the thesaurus that is stored in a MarkLogic Server database and used to determine synonyms. See also: Managing Thesaurus Documents
scoring » expand » dictionary	The location of the phonetic dictionary that is stored in a database and used when comparing words phonetically. See also: Custom Dictionaries
scoring » expand » distanceThreshold	The threshold below which the phonetic difference (distance) between two strings is considered insignificant; i.e., the strings are similar to each other.
scoring » reduce	Combinations of properties whose matching values might be a false match. For example, two members of the same family with the same last names and addresses might be misinterpreted as being the same person. In this case, the score is reduced by the assigned weight to give the match less importance.
scoring » reduce » algorithmRef	The alias of an algorithm definition under the matchOptions/algorithms node of this step.
scoring » reduce » weight	A positive integer that denotes how much to reduce the weight of a match.
scoring » reduce » allMatch	The combination of properties that might falsely indicate a match if the values of these properties are equal between two records.
actions	Custom actions that can be performed when a threshold is reached. See Custom Match Actions.
actions » action	The custom action to perform when a threshold is reached.
actions » action » name	The alias for this action definition.
actions » action » function	The function to run if this action definition is selected.
actions » action » namespace	(Optional) The namespace of the module that contains the function.
actions » action » at	The path to the module that contains the function.
thresholds	Score thresholds that trigger an action.
thresholds » threshold	A score threshold definition, including the action to perform if the threshold is exceeded.
thresholds » threshold » above	The score threshold. If the match score exceeds this value, the action is performed.
thresholds » threshold » label	The alias for this threshold definition.
thresholds » threshold » action	The action to perform if the score is above the threshold. Possible values: notify creates a notification record in the FINAL database with information about the match. merge creates a new record with the combined properties of the original records that match, then archives the old records. The alias of an action definition under the matchOptions/actions node of this step.
tuning » maxScan	The maximum number of highest scoring potential matches that will be considered for merging.

Merging Step Settings

   "4" : {
    "name" : "MyMergingStep",
    "description" : "This is my merging step.",
    "stepDefinitionName" : "default-merging",
    "stepDefinitionType" : "MERGING",
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "stepUpdate" : false,
      "acceptsBatch" : false,
      "sourceDatabase" : "data-hub-FINAL",
      "sourceCollection" : "MyMatchingStep",
      "sourceQuery" : "cts.collectionQuery('my-custom-query')",
      "constrainSourceQueryToJob" : false,
      "targetEntity" : "MyEntity",
      "targetDatabase" : "data-hub-FINAL",
      "collections" : [ "MyMergingStep", "MyPersonEntity" ],
      "additionalCollections" : [],
      "outputFormat" : "json",
      "provenanceGranularityLevel" : "fine",
      "mergeOptions" : { ... }
    }
  },


Field	Description
name	The name of the step instance.
description	A description of the step.
stepDefinitionName	The name of a step definition, which could be one of the default step definitions or a custom one. Custom step definitions can be created using QuickStart or using the Gradle task hubCreateStepDefinition. Tip: If you are customizing a default step type (ingestion, mapping, or mastering), leave the value as `default-ingestion`, `default-mapping`, or `default-mastering`.
stepDefinitionType	The type of the step definition: INGESTION, MAPPING, MATCHING, MERGING, MASTERING, or CUSTOM.
batchSize	The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
threadCount	The number of threads to use when running a flow. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
customHook	Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module	The path to your custom hook module.
customHook » parameters	Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user	The user account to use to run the module. Default is the user running the flow; e.g., `flow-operator`.
customHook » runBefore	For a pre-step hook, set to true. For a post-step hook, set to false.
options » stepUpdate	If `true`, custom modules can make changes directly to records in the database (inserting, deleting, or locking); otherwise, custom modules can make changes indirectly by passing content objects to Data Hub APIs. Direct changes to the database are rarely needed. For a combined mastering step, the default is `true`. For all other types of steps, the default is `false`.
options » acceptsBatch	If `true`, all the records in the batch are processed within a single step run; otherwise, the step is restarted and run for each record in the batch.
options » sourceDatabase	Choose the same source database that you selected in the matching step. Default is `data-hub-FINAL`.
options » sourceCollection	The collection tag to use to search for the records to process in this step.
options » sourceQuery	The CTS query to use to select the source data to process. To filter by a collection tag, use `cts.collectionQuery('my-collection-name')`. Example: `"sourceQuery" : "cts.collectionQuery('default-ingestion')"` See CTS Query.
options » constrainSourceQueryToJob	If `true`, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if `sourceQuery` is `cts.collectionQuery('example')` and `constrainSourceQueryToJob` is `true`, the query searches for documents that are in the `example` collection and were created or modified in the current job. Default is `false`.
options » targetEntity	The entity to map against the source data.
options » targetDatabase	Choose the same database you selected in Source Database. Default is `data-hub-FINAL`. Note: For split mastering (matching step and merging step), both the source database and the target database for both steps must be the same.
options » collections	The collection tags to assign to the resulting records.
additionalCollections	The collection tags to assign to the resulting records, in addition to the default collections.
options » outputFormat	The format of the processed record: Text, JSON, XML, or Binary.
options » provenanceGranularityLevel	The granularity of the provenance tracking information: `coarse` (default) to store document-level provenance information only, `fine` to store document-level and property-level provenance information, or `off` to disable provenance tracking in future job runs.

Merging Options

   "mergeOptions" : {
    "matchOptions" : "mlw-match",
    "propertyDefs" : {
      "properties" : [
        {
          "name" : "ssn",
          "localname" : "IdentificationID",
          "namespace" : ""
        },
        {
          "name" : "shallow",
          "path" : "/es:envelope/es:headers/shallow"
        }
      ],
      "namespaces" : {
        "has" : "has",
        "m" : "http://marklogic.com/smart-mastering/merging",
        "es" : "http://marklogic.com/entity-services"
      }
    },
    "algorithms" : {
      "stdAlgorithm" : {
        "timestamp" : { "path" : "/es:envelope/es:headers/sm:sources/sm:source/sm:dateTime" },
        "namespaces" : {
          "sm" : "http://marklogic.com/smart-mastering",
          "es" : "http://marklogic.com/entity-services"
        }
      },
      "custom" : [
        {
          "name" : "customMerge",
          "function" : "doCustomMerge",
          "namespace" : "http://marklogic.com/smart-mastering/merging",
          "at" : "/custom-merge-xqy.xqy"
        }
      ],
      "collections" : {
        "onMerge" : {
          "function" : "collections",
          "namespace" : "test/merge-collection-algorithm",
          "at" : "/test/suites/customizing-collections/lib/merged-collections.xqy"
        },
        "onArchive" : {
          "remove" : { "collection" : ["Entity"] },
          "add" : { "collection" : ["custom-archived"] }
        },
        "onNoMatch" : {
          "function" : "noMatchCollections",
          "namespace" : "",
          "at" : "/test/suites/customizing-collections/lib/noMatchCollections.sjs"
        },
        "onNotification" : {
          "set" : { "collection" : ["notification"] }
        }
      }
    },
    "mergeStrategies" : [
      {
        "name" : "crm-source-weight",
        "algorithmRef" : "standard",
        "sourceWeights" : [
          {
            "source" : {
              "name" : "CRM",
              "weight" : "10"
            }
          }
        ]
      },
      {
        "name" : "length-weight",
        "algorithmRef" : "standard",
        "maxValues" : "1",
        "length" : { "weight" : "10" }
      }
    ],
    "merging" : [
      {
        "propertyName" : "ssn",
        "maxValues" : "1",
        "maxSources" : "1",
        "strategy" : "crm-source-weight"
      },
      {
        "propertyName" : "name",
        "maxValues" : "1",
        "doubleMetaphone" : {
          "distanceThreshold" : "50"
        },
        "synonymsSupport" : "true",
        "thesaurus" : "/mdm/config/thesauri/first-name-synonyms.xml",
        "length" : { "weight" : "8" }
      },
      {
        "propertyName" : "dob",
        "maxValues" : "1",
        "algorithmRef" : "standard",
        "sourceWeights" : {
          "source" : {
            "name" : "better-source",
            "weight" : "4"
          }
        }
      },
      {
        "default" : "true",
        "strategy" : "crm-source-weight"
      }
    ],
    "tripleMerge" : {
      "function" : "custom-trips",
      "namespace" : "http://marklogic.com/smart-mastering/merging",
      "at" : "/custom-triple-merge.xqy",
      "some-param" : 3
    }
  }


Field	Description
mergeOptions	The settings used to merge records that match. See Smart Mastering Core - Merging Options.
matchOptions	The name of a set of match options that were previously stored in the server. See Saving Options.
propertyDefs	Definitions of properties to merge.
propertyDefs » properties » name	The alias for this property definition.
propertyDefs » properties » localname	The name of the XML element or JSON property (record field) to merge.
propertyDefs » properties » namespace	(Optional) The namespace that encompasses the XML element or JSON property (record field) to merge.
propertyDefs » properties » path	Path leading to the headers or instance sections of records, where the merge properties are defined. XML: /es:envelope/es:headers JSON: /envelope/headers XML: /es:envelope/es:instance JSON: /envelope/instance Note: Namespaces in the path must be defined in propertyDefs/namespaces node.
propertyDefs » namespaces	Key-value pairs that assign aliases to namespaces. The key is the alias and the value is the full namespace.
algorithms	Definitions of algorithms that merge values.
algorithms » stdAlgorithm	The standard algorithm that implements the default merge behavior.
algorithms » stdAlgorithm » timestamp	The path to a timestamp field within the record. This field is used to determine which values to include in the merged property, based on their recency, up to the maximum number specified in the Max Values field in Merge Options (Standard) or in Merge Strategies. Namespaces used in the path must be defined within the record.
algorithms » stdAlgorithm » namespaces	(Optional) Key-value pairs that assign aliases to namespaces. The key is the alias and the value is the full namespace.
algorithms » custom	Definitions of custom algorithms that merge values.
algorithms » custom » name	The alias for this custom algorithm definition.
algorithms » custom » function	The custom merge function to run.
algorithms » custom » namespace	(Optional) The namespace of the module that contains the function.
algorithms » custom » at	The path to the module that contains the function.
algorithms » collections	Rules that specify how collection tags are managed when an event occurs.
algorithms » collections » onMerge	How collection tags are applied to the new record that was created when matching records are merged. The default set of collection tags is comprised of: the union of collection tags from the original records, plus `mdm-content`, plus `mdm-merged`.
algorithms » collections » onMerge » function	The function that manage collection tags if the event occurs.
algorithms » collections » onMerge » namespace	(Optional) The namespace of the module that contains the function.
algorithms » collections » onMerge » at	The path to the module that contains the function.
algorithms » collections » onArchive	How collection tags are applied to the original records after their content have been merged into a new record. The default set of collection tags is comprised of: the collection tags from the original record, minus `mdm-content`, plus `mdm-archived`.
algorithms » collections » onArchive » remove	One or more collection tags to remove from the default union of tags.
algorithms » collections » onArchive » add	One or more collection tags to add to the default union of tags.
algorithms » collections » onNoMatch	How collection tags are applied to records that were not merged because no matches were found or because the total matching scores did not exceed the defined thresholds. The default set of collection tags is comprised of: the collection tags from the original record, plus `mdm-content`.
algorithms » collections » onNoMatch » function	The function that manage collection tags if the event occurs.
algorithms » collections » onNoMatch » namespace	(Optional) The namespace of the module that contains the function.
algorithms » collections » onNoMatch » at	The path to the module that contains the function.
algorithms » collections » onNotification	How collection tags are applied to notification records. The default set of collection tags is comprised of `mdm-notification` only.
algorithms » collections » onNotification » set	One or more collection tags to replace the default union of tags.
mergeStrategies	Predefined configurations for merging.
mergeStrategies » name	The name for the strategy.
mergeStrategies » algorithmRef	The alias of an algorithm definition under the mergeOptions/algorithms node of this step.
mergeStrategies » sourceWeights	The list of data sources and the weights assigned to them. If the set of matching records come from more sources than `maxSources`, the source weights are used to determine which records are included in the merge. Example: If `maxSources` is set to 1, only records from the highest weighted source are included in the merge.
mergeStrategies » sourceWeights » source » name	The name of the source, exactly as shown in the envelopes of the records under headers » sources.
mergeStrategies » sourceWeights » source » weight	The weight used to decide the priority of a source when merging.
mergeStrategies » maxValues	The maximum number of values to allow in the merged property. Default is 99.
mergeStrategies » length » weight	The weight assigned to the length of a string.
merging	Rules that specify how to merge records that match.
merging » propertyName	The alias of a property definition under the mergeOptions/propertyDefs node of this step.
merging » maxValues	The maximum number of values to allow in the merged property. Default is 99.
merging » maxSources	The maximum number of data sources from which to get values to merge. For example, to copy values from a single source, set `maxSources` to 1.
merging » strategy	The alias of a strategy definition under the mergeOptions/mergeStrategies node of this step.
merging » doubleMetaphone	If this setting is present, the Double Metaphone algorithm is used to determine the values to merge.
merging » doubleMetaphone » distanceThreshold	The threshold below which the phonetic difference (distance) between two strings is considered insignificant; i.e., the strings are similar to each other.
merging » synonymsSupport	If `true`, synonyms are included in the list of values to merge. Synonyms are determined using the specified thesaurus.
merging » thesaurus	The location of the thesaurus that is stored in a MarkLogic Server database and used to determine synonyms. See also: Managing Thesaurus Documents
merging » length	The weight assigned to the length of a string.
merging » algorithmRef	The alias of an algorithm definition under the mergeOptions/algorithms node of this step.
merging » sourceWeights	The list of data sources and the weights assigned to them. If the set of matching records come from more sources than `maxSources`, the source weights are used to determine which records are included in the merge. Example: If `maxSources` is set to 1, only records from the highest weighted source are included in the merge.
merging » sourceWeights » source » name	The name of the source, exactly as shown in the envelopes of the records under headers » sources.
merging » sourceWeights » source » weight	The weight used to decide the priority of a source when merging.
merging » default	If `true`, the specified strategy is the default. Important: If this setting is present, do not include a `propertyName` setting.
merging » strategy	The alias of a strategy definition under the mergeOptions/mergeStrategies node of this step.
tripleMerge	Definition of an algorithm that merges triples.
tripleMerge » function	The function that merges triples.
tripleMerge » namespace	The namespace of the module that contains the function.
tripleMerge » at	The path to the module that contains the function.
tripleMerge » some-param	Parameters, as key-value pairs, to pass to your triple merge function.

Mastering Step Settings

   "5" : {
    "name" : "MyMasteringStep",
    "description" : "This is my mastering step.",
    "stepDefinitionName" : "default-mastering",
    "stepDefinitionType" : "MASTERING",
    "batchSize" : 100,
    "threadCount" : "1",
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "stepUpdate" : false,
      "acceptsBatch" : false,
      "sourceDatabase" : "data-hub-FINAL",
      "sourceCollection" : "MyMappingStep",
      "sourceQuery" : "cts.collectionQuery('my-custom-query')",
      "constrainSourceQueryToJob" : false,
      "targetEntity" : "MyEntity",
      "targetDatabase" : "data-hub-FINAL",
      "collections" : [ "default-mastering, mastered" ],
      "additionalCollections" : [],
      "outputFormat" : "json",
      "provenanceGranularityLevel" : "fine",
      "matchOptions" : { ... },
      "mergeOptions" : { ... }
    }
  },


Field	Description
name	The name of the step instance.
description	A description of the step.
stepDefinitionName	The name of a step definition, which could be one of the default step definitions or a custom one. Custom step definitions can be created using QuickStart or using the Gradle task hubCreateStepDefinition. Tip: If you are customizing a default step type (ingestion, mapping, or mastering), leave the value as `default-ingestion`, `default-mapping`, or `default-mastering`.
stepDefinitionType	The type of the step definition: INGESTION, MAPPING, MATCHING, MERGING, MASTERING, or CUSTOM.
batchSize	The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
threadCount	The number of threads to use when running a flow. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
customHook	Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module	The path to your custom hook module.
customHook » parameters	Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user	The user account to use to run the module. Default is the user running the flow; e.g., `flow-operator`.
customHook » runBefore	For a pre-step hook, set to true. For a post-step hook, set to false.
options » stepUpdate	If `true`, custom modules can make changes directly to records in the database (inserting, deleting, or locking); otherwise, custom modules can make changes indirectly by passing content objects to Data Hub APIs. Direct changes to the database are rarely needed. For a combined mastering step, the default is `true`. For all other types of steps, the default is `false`.
options » acceptsBatch	If `true`, all the records in the batch are processed within a single step run; otherwise, the step is restarted and run for each record in the batch.
options » sourceDatabase	Choose the FINAL database where you stored mapped data. Default is `data-hub-FINAL`.
options » sourceCollection	The collection tag to use to search for the records to process in this step.
options » sourceQuery	The CTS query to use to select the source data to process. To filter by a collection tag, use `cts.collectionQuery('my-collection-name')`. Example: `"sourceQuery" : "cts.collectionQuery('default-ingestion')"` See CTS Query.
options » constrainSourceQueryToJob	If `true`, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if `sourceQuery` is `cts.collectionQuery('example')` and `constrainSourceQueryToJob` is `true`, the query searches for documents that are in the `example` collection and were created or modified in the current job. Default is `false`.
options » targetEntity	The entity to map against the source data.
options » targetDatabase	Choose the FINAL database where you want to store mastered data. Default is `data-hub-FINAL`. Note: For combined mastering (mastering step), the source database and the target database should be the same. If duplicates are found, the original records are archived and the merged version is added to the same database. If you want the target database to be different, you can create a custom step with a custom module to override the default behavior of the mastering step.
options » collections	The collection tags to assign to the resulting records.
additionalCollections	The collection tags to assign to the resulting records, in addition to the default collections.
options » outputFormat	The format of the processed record: Text, JSON, XML, or Binary.
options » provenanceGranularityLevel	The granularity of the provenance tracking information: `coarse` (default) to store document-level provenance information only, `fine` to store document-level and property-level provenance information, or `off` to disable provenance tracking in future job runs.

See Matching Options under the matching step section.

See Merging Options under the merging step section.

Custom Step Settings

   "9" : {
    "name" : "MyCustomOtherStep",
    "description" : "This is my custom-other step.",
    "stepDefinitionName" : "custom-step-def",
    "stepDefinitionType" : "CUSTOM",
    "batchSize" : 100,
    "threadCount" : 4,
    "customHook" : {
      "module" : "/custom-modules/your-step-type/your-hook-directory/your-hook-module-name.sjs",
      "parameters" : {},
      "user" : "flow-operator",
      "runBefore" : false
    },
    "options" : {
      "stepUpdate" : false,
      "acceptsBatch" : false,
      "sourceDatabase" : "data-hub-STAGING",
      "sourceCollection" : "my-collection-tag",
      "sourceQuery" : "cts.collectionQuery('my-custom-query')",
      "constrainSourceQueryToJob" : false,
      "targetEntity" : "MyEntity",
      "targetDatabase" : "data-hub-FINAL",
      "collections" : [ "my-collection-tag" ],
      "additionalCollections" : [],
      "outputFormat" : "json"
    }
  }


Field	Description
name	The name of the step instance.
description	A description of the step.
stepDefinitionName	The name of a step definition, which could be one of the default step definitions or a custom one. Custom step definitions can be created using QuickStart or using the Gradle task hubCreateStepDefinition. Tip: If you are customizing a default step type (ingestion, mapping, or mastering), leave the value as `default-ingestion`, `default-mapping`, or `default-mastering`.
stepDefinitionType	The type of the step definition: INGESTION, MAPPING, MATCHING, MERGING, MASTERING, or CUSTOM.
batchSize	The number of documents to process per batch. Each batch goes through all the steps in a flow before the next batch starts. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
threadCount	The number of threads to use when running a flow. If not defined or if set to `0` or null in the step settings, the value in the flow settings is used.
customHook	Definition of a hook that performs additional processes before or after the step. See Creating a Custom Hook Module and Adding a Custom Hook to a Step Manually.
customHook » module	The path to your custom hook module.
customHook » parameters	Parameters, as key-value pairs, to pass to your custom hook module.
customHook » user	The user account to use to run the module. Default is the user running the flow; e.g., `flow-operator`.
customHook » runBefore	For a pre-step hook, set to true. For a post-step hook, set to false.
options » stepUpdate	If `true`, custom modules can make changes directly to records in the database (inserting, deleting, or locking); otherwise, custom modules can make changes indirectly by passing content objects to Data Hub APIs. Direct changes to the database are rarely needed. For a combined mastering step, the default is `true`. For all other types of steps, the default is `false`.
options » acceptsBatch	If `true`, all the records in the batch are processed within a single step run; otherwise, the step is restarted and run for each record in the batch.
options » sourceDatabase	Choose the STAGING database where you stored ingested data. Default is `data-hub-STAGING`.
options » sourceCollection	The collection tag to use to search for the records to process in this step.
options » sourceQuery	The CTS query to use to select the source data to process. To filter by a collection tag, use `cts.collectionQuery('my-collection-name')`. Example: `"sourceQuery" : "cts.collectionQuery('default-ingestion')"` See CTS Query.
options » constrainSourceQueryToJob	If `true`, the query is applied to the documents that were created or modified in the same job that executes the step. Otherwise, the query disregards the job ID. For example, if `sourceQuery` is `cts.collectionQuery('example')` and `constrainSourceQueryToJob` is `true`, the query searches for documents that are in the `example` collection and were created or modified in the current job. Default is `false`.
options » targetEntity	The entity to map against the source data.
options » targetDatabase	Choose the FINAL database where you want to store mapped data. Default is `data-hub-FINAL`.
options » collections	The collection tags to assign to the resulting records.
additionalCollections	The collection tags to assign to the resulting records, in addition to the default collections.
options » outputFormat	The format of the processed record: Text, JSON, XML, or Binary.