Mastering: Matching and Merging

Overview of mastering (matching and merging) in Data Hub.

About Mastering in MarkLogic Data Hub

Smart Mastering is a MarkLogic technology that searches for records in your data that refer to the same entity based on rules you define and then merges them depending on thresholds you specify.

In MarkLogic Data Hub, the mastering step involves two sets of rules:

  • Matching. Matching rules define the properties to compare, how to compare them, and what thresholds must be exceeded before taking action.
  • Merging. Merging rules define how two or more records in the data would be merged together. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.

Matching

You can create one or more rules to determine if two or more records match. Each rule compares the values of a single property in the candidate records. The comparison can be one of the following types:

  • Exact. Determines if the values of the specified property in two or more records are exactly the same.
  • Synonym. Determines if the values of the specified property in two or more records are synonyms, according to the specified thesaurus.
  • Double Metaphone. Determines if the values of the specified property in two or more records sound similar, based on the Double Metaphone algorithm. For example, "Smith" might sound like "Schmidt".
  • Zip. Determines if the zip/postal code in two or more records match.
  • Reduce. Reduces the significance of certain matches. For example, even if the addresses and last names of two records match, the similarity might not necessarily indicate that the two records refer to the same person, because they might be two members of the same family.
  • Custom. Runs a function in your custom module to compare the values of a specified property in two or more records.

Then, you can specify thresholds and what to do when a threshold is exceeded. For example, exceeding a threshold could:

  • Trigger an automatic merge.
  • Send a notification.
  • Run a custom module.

Merging

In a merge, a new record is created and the values from the original records could be combined and copied to the new record, according to the rules you specify. For example,

  • You can restrict the number of unique values copied to the new record.
  • You can restrict the number of data sources from which to copy values.
  • You can specify that only records from specific data sets are allowed to be merged. And you can assign a weight to each source, so you can give priority to more reliable sources.
  • You can also assign a weight to the length of a string.

If you use certain combinations of merge settings, you can save them as a strategy and refer to them by the strategy name.

You can also do your own merge using a custom module.

Merging is non-destructive. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.