Matching

Overview of matching in Data Hub.

About Matching in MarkLogic Data Hub

Matching determines if two records are candidates for merging, based on the degree of similarity between them and the weight of the comparison. Matching rules define the properties to compare, how to compare them, and what thresholds must be exceeded before taking action.

  • Match options define how the records are compared.
  • Match thresholds define the limits and the actions to take when those limits are exceeded.

The matching configuration is stored as properties at the root of the match step document. See Matching Step Configuration Structure.

Rules

You can create one or more rules to determine if two or more records match. Each rule compares the values of a single property in the candidate records. The comparison can be one of the following types:

  • Exact. Determines if the values of the specified entity property in two or more records are exactly the same.
  • Synonym. Determines if the values of the specified entity property in two or more records are synonyms, according to the specified thesaurus.
  • Double Metaphone. Determines if the values of the specified entity property in two or more records sound similar, based on the Double Metaphone algorithm. For example, "Smith" might sound like "Schmidt".
  • Zip. Determines if the zip/postal code in two or more records match.
  • Reduce. Reduces the significance of certain matches. For example, even if the addresses and last names of two records match, the similarity might not necessarily indicate that the two records refer to the same person, because they might be two members of the same family.
  • Custom. Runs a function in your custom module to compare the values of a specified entity property in two or more records.

Matching rules are contained under the matchRulesets array with each match ruleset having a name, weight, and match rules. Every query provided by each match rule must be true against a candidate record for the weight to be added to the candidate’s score.

All of the information for the property and the match type is associated directly with the match rule. Each match rule defines the path to the property, the match type, and a set of options specific to the match type.

There are two ways to target values in a record for matching in a matching rule:

  • entityPropertyPath is a dot notation that indicates the location of a property in a record by chaining together the property titles with a period.
  • documentXPath is XPath that targets the location of a node in the record.

Thresholds

You can specify thresholds and what to do when a threshold is exceeded. For example, exceeding a threshold could:

  • Trigger an automatic merge.
  • Send a notification.
  • Run a custom module.

Thresholds are score limits that are associated with actions and do not overlap. The actions occur at the time the Merging step is run. The actions provided out of the box are merge and notify. The notify action creates notification documents in the content database that define a match between documents which can be later reviewed. The merge action creates a new record based on the group of records that meet the threshold score.

Testing a Matching Step

When testing Matching steps, MarkLogic recommends that you first understand what a successful match looks like. This enables you to identify whether matches are made for the right reason.

Note: MarkLogic recommends testing with a small number of records (250 or less).

When you are done testing a Matching step, delete records tagged with the datahubMasteringMatchSummary collection.

Check if Matching Step Ran in Hub Central

To check if a Matching step ran successfully, search for records tagged with the datahubMasteringMatchSummary collection.

Tip: In Hub Central, the collection might be followed by a hyphen and the name of your entity model.

To search for records by collection, see Filter Data Using Hub Central.

Delete Records Based on Collection

Hub Central

To delete records tagged with a collection in Hub Central, clear a subset of user data based on the datahubMasteringMatchSummary collection.

  1. Click the info icon () at the top.
  2. In the Clear User Data tile, select the Clear Subset of User Data option.
    1. In Select a Database drop-down menu, choose the database from which to clear user data.
    2. In Based on drop-down menu, choose Collection. Then search for the datahubMasteringMatchSummary collection to clear a subset of user data based on this collection.
  3. Click Clear.

Gradle

Deletes all user data based on a collection.

./gradlew mlDeleteCollections -Pcollections=datahubMasteringMatchSummary -igradlew.bat mlDeleteCollections -Pcollections=datahubMasteringMatchSummary -i