Steps

About Steps

A flow is comprised of one or more steps that process or enhance the data.

A step can be one of the following types:


Step	Input to step	What the step does	Result of step
Ingestion	Raw data from one source	Wraps each item in an envelope and stores the wrapped items as records in the STAGING database.	Ingested data in the STAGING database
Mapping	Ingested data in the STAGING database Entity model definition	Associates the fields in the entity model with the corresponding fields in your source data and then stores the mapped data in the FINAL database.	Mapped data in the FINAL database
Matching	Mapped data in the FINAL database	Checks for possible duplicates in your data.	Internal match summaries in the FINAL database. Each match summary contains a list of records to be merged.
Merging	Internal match summaries created by a matching step	Handles the lists of candidates accordingly based on the specified criteria.	If the comparison of two records definitely meets the specified criteria for duplicates, a new record based on the two duplicate records is created in the FINAL database, and the old entries are tagged as archived but remain in the FINAL database. If the comparison of two records meets the specified criteria for possible matches (not definite matches), a notification is created in the FINAL database. The notification contains information about the two records. Otherwise, no changes are made.
Mastering	Mapped data in the FINAL database	Checks for possible duplicate documents in your data and handles them accordingly based on the specified criteria.
Custom - Ingestion Custom - Mapping Custom - Mastering Custom - Other	Depends on custom code.	Runs the custom code specified in the step definition. The custom code can further process, enhance, or validate your data. A custom step can also replace the default processing included in MarkLogic Data Hub. For example, you can define a different way of ingesting your data by creating a custom ingestion step.	Depends on custom code.

Note: The STAGING database and the FINAL database are the default storage for ingested data and harmonized data, respectively; however, you can use any database.

Choosing Steps for Your Flow

A flow can contain any combination of steps (ingestion, mapping, matching, merging, mastering, and custom). You can create as many flows as you need with various combinations of steps. For example, you can create one flow for ingestion only and another flow that contains both the mapping and mastering steps.

Each predefined type of step (ingestion, mapping, matching, merging, and mastering) has its own set of prerequisites, which is typically the output of another step. For example,

Before you can configure and run a mapping step, you must have some enveloped data in a database (the result of an ingestion step).
Before running the mastering step, you must have some mapped data in a database (the result of a mapping step), and all data to be compared must be mapped to the same entity model.

Essentially, an ingestion step must be executed before a mapping step, which must be executed before a mastering step. If using split-step mastering, the matching step must be executed before the merging step.

These steps can be in separate flows. However, even if they are in the same flow, you can still choose which steps are executed when running the flow.

Tip: Create a separate flow for ingesting each data source. For example,

Flow A might handle ingesting HR data from a New York subsidiary and then mapping that ingested data to an entity model.
Flow B might handle similar operations for HR data from a San Francisco subsidiary.

Ingestion
- If you need to load your data into the STAGING database, add an ingestion step.
- If your data is already wrapped in envelopes and stored in the STAGING database, skip.
Mapping
- If your source's fields do NOT correspond one-to-one with your entity model's properties, add a custom step with a link to a custom module that handles the mapping between your source and your entity model.
- If your source's fields need additional processing, such as calculations, add a custom step with a link to a custom module that performs the calculations.
- If you are mapping XML documents, add a custom step with a link to a custom module that handles XML documents.
- If your source requires more complex transformation than a simple typecast, add a custom step with a link to a custom module that performs the transformation.
- If your data has already been mapped against your entity model and stored in the FINAL database, skip.
- Otherwise, add a mapping step.
Mastering
- If you want to keep duplicates in your data, skip.
- If you would like to use MarkLogic's Smart Mastering technology to identify duplicate documents and merge the duplicates and ...
  - Your dataset could be matched and merged using a single thread at an acceptable performance level, add a mastering step.
  - Your dataset is extremely large and/or could have a large number of duplicates, thereby needing multiple threads for better performance, add a matching step followed by a merging step.
- Otherwise, add a custom step with a link to a custom module that identifies duplicate documents and handles them as you wish.

Note: If a custom step is not intended to replace a predefined step, you can insert it anywhere in the flow. For example, if you want your custom module to further enhance your ingested data before mapping, you can insert the custom module between the ingestion step and the mapping step.

Customizing a Step

You can customize a step in multiple ways.

To replace a predefined process (ingestion, mapping, or mastering), create a custom step and choose the appropriate custom step type (Ingestion, Mapping, or Mastering.

To add processing that doesn't fall neatly in any of the predefined processes, create a custom step and choose the custom step type Other.

To add processing that must be performed immediately before or immediately after any step, create the appropriate step and add a custom hook.

To customize the mapping process, you can create custom mapping functions to use in your mapping definition, in addition to the predefined Data Hub mapping functions.