Mastering: Matching and Merging

Overview of mastering (matching and merging) in Data Hub.

About Mastering in MarkLogic Data Hub

Smart Mastering is a MarkLogic technology that searches for records in your data that refer to the same entity based on rules you define and then merges them depending on thresholds you specify.

In MarkLogic Data Hub, the mastering step involves two processes with their associated sets of rules:

  • Matching determines if two records are candidates for merging, based on the degree of similarity between them and the weight of the comparison. Matching rules define the properties to compare, how to compare them, and what thresholds must be exceeded before taking action.
    • Match options define how the records are compared.
    • Match thresholds define the limits and the actions to take when those limits are exceeded.
  • Merging handles the candidates accordingly, based on thresholds. Merging rules define how two or more records in the data would be merged together. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.
    • Merge options define how the properties of the candidate records are combined.
    • Merge strategies are sets of merge options that you can name and reuse.
    • Merge collections are sets of records that have the same collection tags.

Matching

You can create one or more rules to determine if two or more records match. Each rule compares the values of a single property in the candidate records. The comparison can be one of the following types:

  • Exact. Determines if the values of the specified property in two or more records are exactly the same.
  • Synonym. Determines if the values of the specified property in two or more records are synonyms, according to the specified thesaurus.
  • Double Metaphone. Determines if the values of the specified property in two or more records sound similar, based on the Double Metaphone algorithm. For example, "Smith" might sound like "Schmidt".
  • Zip. Determines if the zip/postal code in two or more records match.
  • Reduce. Reduces the significance of certain matches. For example, even if the addresses and last names of two records match, the similarity might not necessarily indicate that the two records refer to the same person, because they might be two members of the same family.
  • Custom. Runs a function in your custom module to compare the values of a specified property in two or more records.

Then, you can specify thresholds and what to do when a threshold is exceeded. For example, exceeding a threshold could:

  • Trigger an automatic merge.
  • Send a notification.
  • Run a custom module.

Merging

In a merge, a new record is created and the values from the original records could be combined and copied to the new record, according to the rules you specify. For example,

  • You can restrict the number of unique values copied to the new record.
  • You can restrict the number of data sources from which to copy values.
  • You can specify that only records from specific data sets are allowed to be merged. And you can assign a weight to each source, so you can give priority to more reliable sources.
  • You can also assign a weight to the length of a string.

If you use certain combinations of merge settings, you can save them as a strategy and refer to them by the strategy name.

You can also do your own merge using a custom module.

Merging is non-destructive. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.

Combined-Step versus Split-Step Mastering

Data Hub provides the ability to perform matching and merging either within a single step (mastering step) or with two steps (matching step and merging step).

In combined-step mastering, the records in the specified collection or query are compared with each other. If matches are found, a new merged record is created and the original records are archived. Both matching and merging are done in the same mastering step. Combined-step mastering must be run with exactly one thread.

In split-step mastering,
  • The matching step compares the records tagged with the Source Collection. If matches are found, the step creates match summaries that list the URIs of matching records, based on the matching rules that you specify in the step.
  • The merging step reads the match summaries and merges the specified records. based on the merging rules that you specify in the step.

In most cases, split-step mastering is ideal, because it can use multiple threads and avoid locks more effectively.

Guidelines for Split-Step Mastering

With split-step mastering, keep the following guidelines in mind:

  • When creating the matching step, specify a collection tag to add so you can undo the matching, if needed. Replace the collection tag before each run.
  • Use the same database for both the Source Database and the Target Database in both the matching step and the merging step.
  • Run the matching step only once for the same dataset. The merging step will process all existing match summaries, even if they were generated by different matching step runs on different snapshots of the data.
  • When running the split mastering, allow the matching step to complete before starting the merging step.
  • Run the merging step as soon as possible after the matching step.
    • If a matching record changes between the steps, it will be merged anyway even if it no longer meets the matching criteria.
    • If new records are ingested and mapped between the matching step and the merging steps, newly ingested and mapped records are not automatically added to the existing match summaries.

    If significant changes were made to the data (i.e., modifications and newly mapped records) before the merging step is run, undo the previous matching then run the matching step again.

  • Before running a matching step or a merging step, check the list of existing matching summaries to make sure that your merging step doesn't process out-of-date matches.
  • If a record appears in multiple match summaries created in different matching step runs, the most recent match has precedence during merging.

Check for Previous Matching

To check if a matching step has been run, search for records tagged with the collection names that start with datahubMasteringMatchSummary. It might be followed by a hyphen and the name of your entity model.

  1. Go to Browse Data.
  2. Set the database to FINAL or the database you selected.
  3. In the facets on the left, select the datahubMasteringMatchSummary collection.

Undo Previous Matching

To undo the matching step run, delete the match summaries created by that matching step run.

  1. Go to the Query Console.
  2. Search for all records with the datahubMasteringMatchSummary collection tag.
  3. Delete or archive the records found.