Steps

Overview of steps in Data Hub.

About Steps

A flow is comprised of one or more steps that process or enhance the data.

A step can be one of the following types:

Step Input to step What the step does Result of step
Ingestion Raw data from one source Wraps each item in an envelope and stores the wrapped items as documents in the STAGING database. Ingested data in the STAGING database
Mapping
  • Ingested data in the STAGING database
  • Entity model definition
Associates the fields in the entity model with the corresponding fields in your source data and then stores the mapped data in the FINAL database. Mapped data in the FINAL database
Mastering Mapped data in the FINAL database Checks for possible duplicate documents in your data and manages them accordingly based on specified criteria.

If the comparison of two documents definitely meets the specified criteria for duplicates, a new document based on the two duplicate documents is created in the FINAL database, and the old entries are tagged as archived but remain in the FINAL database.

If the comparison of two documents meets the specified criteria for possible matches (not definite matches), a notification document is created in the FINAL database. The notification contains information about the two documents.

Otherwise, no changes are made.

Custom Depends on custom code. Runs the custom code specified in the step definition. The custom code can further process, enhance, or validate your data. A custom step can also replace the default processing included in MarkLogic Data Hub. For example, if your data is too complex to be handled by the default mapping step, you can use a custom step to harmonize your data. Depends on custom code.
Note: The STAGING database and the FINAL database are the default storage for ingested data and harmonized data, respectively; however, you can use any database.

Choosing Steps for Your Flow

You can create as many flows as you need with various combinations of steps.

However, each predefined type of step (ingestion, mapping, and mastering) has its own set of prerequisites, which is typically the output of another step. For example,
  • Before you can configure and run a mapping step, you must have some enveloped data in a database (the result of an ingestion step).
  • Before running the mastering step, you must have some mapped data in a database (the result of a mapping step), and all data to be compared must be mapped to the same entity model.

Essentially, an ingestion step must be executed before a mapping step, which must be executed before a mastering step. However, these steps are not required to be in the same flow.

A flow can contain any combination of steps (ingestion, mapping, mastering, and custom). For example, you can create one flow for ingestion only and another flow that contains both the mapping and mastering steps.

Tip: Create a separate flow for ingesting each data source. For example,
  • Flow A might handle ingesting HR data from a New York subsidiary and then mapping that ingested data to an entity model.
  • Flow B might handle similar operations for HR data from a San Francisco subsidiary.
  • Ingestion
    • If you need to load your data into the STAGING database, add an ingestion step.
    • If your data is already wrapped in envelopes and stored in the STAGING database, skip.
  • Mapping
    • If your source's fields do NOT correspond one-to-one with your entity model's properties, add a custom step with a link to a custom module that handles the mapping between your source and your entity model.
    • If your source's fields need additional processing, such as calculations, add a custom step with a link to a custom module that performs the calculations.
    • If you are mapping XML documents, add a custom step with a link to a custom module that handles XML documents.
    • If you are mapping non-flat JSON documents (i.e., some properties are nested), add a custom step with a link to a custom module that handles complex JSON documents.
    • If your source requires more complex transformation than a simple typecast, add a custom step with a link to a custom module that performs the transformation.
    • If your data has already been mapped against your entity model and stored in the FINAL database, skip.
    • Otherwise, add a mapping step.
  • Mastering
    • If you want to keep duplicates in your data, skip.
    • If you would like to use MarkLogic's Smart Mastering technology to identify duplicate documents and merge the duplicates, add a mastering step.
    • Otherwise, add a custom step with a link to a custom module that identifies duplicate documents and handles them as you wish.
Note: If a custom step is not intended to replace a predefined step, you can insert it anywhere in the flow. For example, if you want your custom module to further enhance your ingested data before mapping, you can insert the custom module between the ingestion step and the mapping step.