Overview of steps in Data Hub.
A flow is comprised of one or more steps that process or enhance the data.
A step can be one of the following types:
|Step||Input to step||What the step does||Result of step|
|Ingestion||Raw data from one source||Wraps each item in an envelope and stores the wrapped items as records in the STAGING database.||Ingested data in the STAGING database|
||Associates the fields in the entity model with the corresponding fields in your source data and then stores the mapped data in the FINAL database.||Mapped data in the FINAL database|
|Matching||Mapped data in the FINAL database||Checks for possible duplicates in your data.||Internal match summaries in the FINAL database. Each match summary contains a list of records to be merged.|
|Merging||Internal match summaries created by a matching step||Handles the lists of candidates accordingly based on the specified criteria.||
If the comparison of two records definitely meets the specified criteria for duplicates, a new record based on the two duplicate records is created in the FINAL database, and the old entries are tagged as archived but remain in the FINAL database.
If the comparison of two records meets the specified criteria for possible matches (not definite matches), a notification is created in the FINAL database. The notification contains information about the two records.
Otherwise, no changes are made.
|Mastering||Mapped data in the FINAL database||Checks for possible duplicate documents in your data and handles them accordingly based on the specified criteria.|
||Depends on custom code.||Runs the custom code specified in the step definition. The custom code can further process, enhance, or validate your data. A custom step can also replace the default processing included in MarkLogic Data Hub. For example, you can define a different way of ingesting your data by creating a custom ingestion step.||Depends on custom code.|
Choosing Steps for Your Flow
A flow can contain any combination of steps (ingestion, mapping, matching, merging, mastering, and custom). You can create as many flows as you need with various combinations of steps. For example, you can create one flow for ingestion only and another flow that contains both the mapping and mastering steps.
- Before you can configure and run a mapping step, you must have some enveloped data in a database (the result of an ingestion step).
- Before running the mastering step, you must have some mapped data in a database (the result of a mapping step), and all data to be compared must be mapped to the same entity model.
Essentially, an ingestion step must be executed before a mapping step, which must be executed before a mastering step. If using split-step mastering, the matching step must be executed before the merging step.
These steps can be in separate flows. However, even if they are in the same flow, you can still choose which steps are executed when running the flow.
- Flow A might handle ingesting HR data from a New York subsidiary and then mapping that ingested data to an entity model.
- Flow B might handle similar operations for HR data from a San Francisco subsidiary.
- If you need to load your data into the STAGING database, add an ingestion step.
- If your data is already wrapped in envelopes and stored in the STAGING database, skip.
- If your source's fields do NOT correspond one-to-one with your entity model's properties, add a custom step with a link to a custom module that handles the mapping between your source and your entity model.
- If your source's fields need additional processing, such as calculations, add a custom step with a link to a custom module that performs the calculations.
- If you are mapping XML documents, add a custom step with a link to a custom module that handles XML documents.
- If your source requires more complex transformation than a simple typecast, add a custom step with a link to a custom module that performs the transformation.
- If your data has already been mapped against your entity model and stored in the FINAL database, skip.
- Otherwise, add a mapping step.
- If you want to keep duplicates in your data, skip.
- If you would like to use MarkLogic's Smart Mastering technology to identify duplicate documents and merge the duplicates and ...
- Your dataset could be matched and merged using a single thread at an acceptable performance level, add a mastering step.
- Your dataset is extremely large and/or could have a large number of duplicates, thereby needing multiple threads for better performance, add a matching step followed by a merging step.
- Otherwise, add a custom step with a link to a custom module that identifies duplicate documents and handles them as you wish.
Customizing a Step
You can customize a step in multiple ways.