Steps

Overview of steps in Data Hub.

About Steps

A flow is comprised of one or more steps that process or enhance the data.

A step can be one of the following types:

Step (Hub Central) Step (QuickStart) Input to step What the step does Result of step
Loading Ingestion Raw data from one source Wraps each item in an envelope and stores the wrapped items as records in the STAGING database. Ingested/Loaded data in the STAGING database
Mapping Mapping
  • Ingested data in the STAGING database
  • Entity model definition
Associates the fields in the entity model with the corresponding fields in your source data and then stores the mapped data in the FINAL database. Mapped data in the FINAL database
Matching Matching Mapped data in the FINAL database Checks for possible duplicates in your data. Internal match summaries in the FINAL database. Each match summary contains a list of records to be merged.
Merging Merging Internal match summaries created by a matching step Handles the lists of candidates accordingly based on the specified criteria.

If the comparison of two records definitely meets the specified criteria for duplicates, a new record based on the two duplicate records is created in the FINAL database, and the old entries are tagged as archived but remain in the FINAL database.

If the comparison of two records meets the specified criteria for possible matches (not definite matches), a notification is created in the FINAL database. The notification contains information about the two records.

Otherwise, no changes are made.

Mastering Mapped data in the FINAL database Checks for possible duplicate documents in your data and handles them accordingly based on the specified criteria.
  • Custom-Loading
  • Custom
  • Custom-Ingestion
  • Custom-Mapping
  • Custom-Mastering
  • Custom-Other
Depends on custom code. Runs the custom code specified in the step definition. The custom code can further process, enhance, or validate your data. A custom step can also replace the default processing included in MarkLogic Data Hub. For example, you can define a different way of ingesting your data by creating a custom ingestion step. Depends on custom code.
Note: The STAGING database and the FINAL database are the default storage for ingested data and harmonized data, respectively; however, you can use any database.

Choosing Steps for Your Flow

A flow can contain any combination of steps (ingestion/loading, mapping, matching, merging, mastering, and custom). You can create as many flows as you need with various combinations of steps. For example, you can create one flow for ingestion/loading only and another flow that contains both the mapping and mastering steps.

Each predefined type of step (ingestion/loading, mapping, matching, merging, and mastering) has its own set of prerequisites, which is typically the output of another step. For example,
  • Before you can configure and run a mapping step, you must have some enveloped data in a database (the result of an ingestion/loading step).
  • Before running the mastering step, you must have some mapped data in a database (the result of a mapping step), and all data to be compared must be mapped to the same entity model.

Essentially, an ingestion/loading step must be executed before a mapping step, which must be executed before a mastering step. If using split-step mastering, the matching step must be executed before the merging step.

These steps can be in separate flows. However, even if they are in the same flow, you can still choose which steps are executed when running the flow.

Tip: Create a separate flow for ingesting each data source. For example,
  • Flow A might handle ingesting HR data from a New York subsidiary and then mapping that ingested data to an entity model.
  • Flow B might handle similar operations for HR data from a San Francisco subsidiary.
  • Ingestion/Loading
    • If you need to load your data into the STAGING database, add an ingestion/loading step.
    • If your data is already wrapped in envelopes and stored in the STAGING database, skip.
  • Mapping
    • If your source's fields do NOT correspond one-to-one with your entity model's properties, add a custom step with a link to a custom module that handles the mapping between your source and your entity model.
    • If your source's fields need additional processing, such as calculations, add a custom step with a link to a custom module that performs the calculations.
    • If you are mapping XML documents, add a custom step with a link to a custom module that handles XML documents.
    • If your source requires more complex transformation than a simple typecast, add a custom step with a link to a custom module that performs the transformation.
    • If your data has already been mapped against your entity model and stored in the FINAL database, skip.
    • Otherwise, add a mapping step.
  • Mastering
    • If you want to keep duplicates in your data, skip.
    • If you would like to use MarkLogic's Smart Mastering technology to identify duplicate documents and merge the duplicates and ...
      • Your dataset could be matched and merged using a single thread at an acceptable performance level, add a mastering step.
      • Your dataset is extremely large and/or could have a large number of duplicates, thereby needing multiple threads for better performance, add a matching step followed by a merging step.
    • Otherwise, add a custom step with a link to a custom module that identifies duplicate documents and handles them as you wish.
Note: If a custom step is not intended to replace a predefined step, you can insert it anywhere in the flow. For example, if you want your custom module to further enhance your ingested data before mapping, you can insert the custom module between the ingestion/loading step and the mapping step.

Testing Data Hub Applications

Data Hub supports the following unit testing capabilities. See this example project for an example of how to use these capabilities.

Data Hub supports using marklogic-unit-test

Note: consider running multiple Docker containers, with one for running and manually testing your application, and one for running your automated tests. Doing so achieves a similar result as when you have mirrored test resources.

You can perform testing in Data Hub that includes but is not limited to the following:

  1. Testing an entire flow
  2. Testing a mapping step
  3. Testing a matching step
  4. Testing a merging step
  5. Testing a custom step
  6. Testing a custom mapping function

Data Hub supports:

  1. Writing tests within MarkLogic using SJS or XQY
  2. Writing JUnit 5 tests that exercise one or many MarkLogic endpoints and make use of Data Hub Java APIs
  3. Running either or both of the tests above

With this functionality, you can do the following:

  1. Add marklogic-unit-test as an mlBundle to your Data Hub project and use it in the same way as in an ml-gradle project.
  2. Run ml-gradle's "mlUnitTest" to run all of your marklogic-unit-test tests and get back a report of what succeeded and what failed.
  3. Import a Data Hub library with test helper functions that simplify writing tests that depend on a database being "clean" - i.e. only Data Hub/user artifacts, no job data from previous runs.
  4. Run a Gradle task - e.g. "hubGenerateUnitTest" - that scaffolds out a basic marklogic-unit-test test that imports the Data Hub test library
  5. Import a Data Hub-specific JUnit 5 library that assists with writing Data Hub-specific JUnit 5 tests, including "cleaning" the database before a test method runs
  6. Read docs on how to run all of her marklogic-unit-test tests when you "gradle test"you have to add a test class to your source tree in order for this to work),

Marklogic-unit-test testing support

The following options are supported:

  1. Adding marklogic-unit-test-modules as an mlBundle dependency to your project, and those modules are deployed via mlDeploy / hubDeploy.
  2. Adding test modules to e.g. src/test/resources/ml-modules and configuring mlModulePaths to include that path.
  3. Importing "/data-hub/marklogic-unit-test/hub-test-helper.xqy" to access Data Hub-specific helper functions for tests (this XQuery module is consistent with the marklogic-unit-test's "test-helper.xqy" module, and XQY is easier to reuse in SJS than vice versa). These include:
    • Preparing a database before a test runs - e.g. clearing everything except artifacts from staging and final, and removing jobs/prov data from jobs
    • Getting a document as a "record", with its permissions/collections/metadata include, for easy assertions
    • Assertion functions for a document's permissions and collections
  1. Reusing the ml-gradle "mlUnitTest" task to run all of your marklogic-unit-test tests.
  2. Running a new "hubGenerateUnitTest" Gradle task that creates a new marklogic-unit-test test suite which reuses hub-test-helper.xqy to prepare the database

To support JUnit5 testing, you have access to the "com.marklogic:marklogic-data-hub-junit5" module. You can add this to upir test classpath and gain the following support:

  1. An abstract base class that your tests can extend which provides the following:
    • A HubConfigImpl object based on your gradle.properties / gradle-local.properties
    • The ability to construct a HubConfigImpl as a different user, again based your Gradle property files, so you can run tests as different users with different roles.
    • Automatically preparing a database in the same manner as described above (e.g. reset staging/final, delete jobs/prov data).
    • Running a flow (a typical use case for a JUnit5 test)
    • A Logger object for logging convenience
  1. Constructing your own test base class that doesn't extend the Data Hub-provided one, but can easily reuse the above capabilities. This ensures that if you don’t like something that the Data Hub base class does, you can easily construct your own without losing anything.
  2. Creating a JUnit 5 class that can run all of her marklogic-unit-test modules as "parameterized" tests. To do so, you have to have a class in your test source tree that does this.
  3. Running ingestion steps that may have developer-specific absolute file paths in them
  4. Support for parallel tests, given multiple host names, which speeds up tests, assuming that you have multiple clusters at your disposal for running tests.

Use cases that illustrate this functionality

  • Prepare databases before running marklogic-unit-test test: cover adding the XQY library and being able to import it and use it to prepare the database
  • Generate marklogic-unit-test test suite: Gradle task
  • Test flow by extending Data Hub base test class: Junit5 project that provide the features above for JUnit 5; supports an ingestion step with an absolute file path in it
  • Test flow by reusing Data Hub test support: how to reuse Data Hub test support without extending its base test class
  • Run marklogic-unit-test tests via JUnit 5

Data Hub has the following new APIs:

  1. A marklogic-data-hub-junit5 library that you can import into your Java test classpath
  2. A /data-hub/marklogic-unit-test/hub-test-helper.xqy library to help you with writing Data Hub-specific marklogic-unit-test tests.

The marklogic-data-hub-junit5 includes logging.

The Data Hub test library is included in the Data Hub install. You access the JUnit5 library via normal Graven/Maven dependency resolution.