Mastering: Matching and Merging

Overview of mastering (matching and merging) in Data Hub.

About Mastering in MarkLogic Data Hub

Smart Mastering is a MarkLogic technology that searches for records in your data that refer to the same entity based on rules you define and then merges them depending on thresholds you specify.

In MarkLogic Data Hub, Smart Mastering refers to two steps: Matching and Merging steps. The Matching and Merging steps typically run after the Mapping step.

  • Matching determines if two records are candidates for merging, based on the degree of similarity between them and the weight of the comparison. Matching rules define the properties to compare, how to compare them, and what thresholds must be exceeded before taking action.
    • Match options define how the records are compared.
    • Match thresholds define the limits and the actions to take when the results exceed them.
  • Merging handles the candidates accordingly, based on thresholds. Merging rules define how two or more records in the data would be merged together. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.
    • Merge options define how the properties of the candidate records are combined.
    • Merge strategies are sets of merge options that you can name and reuse.
    • Merge collections are sets of records that have the same collection tags.

Matching

You can create one or more rules to determine if two or more records match. Each rule compares the values of a single property in the candidate records. The comparison can be one of the following types:

  • Exact. Determines if the values of the specified entity property in two or more records are exactly the same.
  • Synonym. Determines if the values of the specified entity property in two or more records are synonyms, according to the specified thesaurus.
  • Double Metaphone. Determines if the values of the specified entity property in two or more records sound similar, based on the Double Metaphone algorithm. For example, "Smith" might sound like "Schmidt".
  • Zip. Determines if the zip/postal code in two or more records match.
  • Reduce. Reduces the significance of certain matches. For example, even if the addresses and last names of two records match, the similarity might not necessarily indicate that the two records refer to the same person, because they might be two members of the same family.
  • Custom. Runs a function in your custom module to compare the values of a specified entity property in two or more records.

Then, you can specify thresholds and what to do when a threshold is exceeded. For example, exceeding a threshold could:

  • Trigger an automatic merge.
  • Send a notification.
  • Run a custom module.

Merging

In a merge, a new record is created and the values from the original records could be combined and copied to the new record, according to the rules you specify. For example,

  • You can restrict the number of unique values copied to the new record.
  • You can restrict the number of data sources from which to copy values.
  • You can specify that only records from specific datasets are allowed to be merged. And you can assign a weight to each source, so you can give priority to more reliable sources.
  • You can also assign a weight to the length of a string.

If you use certain combinations of merge settings, you can save them as a strategy and refer to them by the strategy name.

You can also do your own merge using a custom module.

Merging is non-destructive. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.

Check for Previous Matching

To check if a matching step has been run, search for records tagged with the collection names that start with datahubMasteringMatchSummary. It might be followed by a hyphen and the name of your entity model.

  1. Go to Browse Data.
  2. Set the database to FINAL or the database you selected.
  3. In the facets on the left, select the datahubMasteringMatchSummary collection.

Undo Previous Matching

To undo the matching step run, delete the match summaries created by that matching step run.

  1. Go to the Query Console.
  2. Search for all records with the datahubMasteringMatchSummary collection tag.
  3. Delete or archive the records found.

Examples of Matching and Merging Components

Matching leverages MarkLogic search capabilities to find candidate documents based on search criteria or match rules that the user defines. The weight given to each matching ruleset determines the overall score of the candidate document. Thresholds define actions that are taken given a candidate document’s weight. The actions assigned to a threshold are "merge", "notify", or "custom".

Merging acts on the output of the matching step. Merging rules can be defined to determine the survivorship of Entity properties, giving control of what is carried forward into a merged document. Merging configuration can also adjust collections applied to the different categorizations of documents that are output by the merging step. These categorizations include archived documents, merged documents, and notification documents indicating near matches.

Matching: Custom Scoring

The matching configuration is stored as properties at the root of the match step document.

Note: Developers can add references to custom code to affect how documents are scored by reference custom in the match step.

	  
“scoreDocumentInterceptors”: [
	  
          {“path”: “/custom-modules/matching/interceptors/scoreMatch.sjs”, “function”: “scoreA”},
          {“path”: “/custom-modules/matching/interceptors/scoreMatch.sjs”, “function”: “scoreB”}
          ] 

      The signature for the custom code:

      scoreDocumentInterceptor(defaultScore, document, matchingDocument, matchingRulesets)
	  
      Returns a score in the form of a double of 2 documents

          @param {double} defaultScore - score to determine the threshold the documents match lands in
          @param {ContentObject} contentObjectA

          @param {ContentObject} contentObjectB

          @param {[]MatchRulesetDefinition} matchingRulesets

          @return double

     MatchRulesetDefinition
          -name() Name for the Match Ruleset
          -weight() Score for the Match Ruleset
	  
	  
Note: Developers can add match step configurations that restrict which documents can match with other documents.

"filterQueryInterceptors": [
   { "path": "/non-existing/matchableInterceptors.sjs", "function": "myNonExistentFunction" {color:#6a8759}}
]
      

Example filter query interceptor code:


function filterQueryInterceptor(filterQuery, docNode) {
   return cts.andQuery([filterQuery, cts.collectionQuery(fn.string(docNode.xpath("envelope/headers/filterCollection")))]);

}	  
	  
Note: Developers can add match step configurations that alter the cts query for matching a set of documents.

"filterQueryInterceptors": [
   { "path": "/non-existing/matchableInterceptors.sjs", "function": "myNonExistentFunction" {color:#6a8759}}
]
	  

Example filter query interceptor code:


function filterQueryInterceptor(filterQuery, docNode) {
   return cts.andQuery([filterQuery, cts.collectionQuery(fn.string(docNode.xpath("envelope/headers/filterCollection")))]);

}
	  

Rules

Matching rules are contained under the “matchRulesets” array with each match ruleset having a name, weight, and match rules. Every query provided by each match rule must be true against a candidate document for the weight to be added to the candidate’s score.

All of the information for the property and the match type is associated directly with the match rule. Each match rule defines the entity property path to the property being, the match type, and a set of options that are specific to the given match type.

 
"matchRulesets": [
  {
    "name": "name - Exact",
    "weight": 3.5, 
    "matchRules": [
      {
        "entityPropertyPath": "name.lastName",
        "matchType": "exact",
        "options": {}
      } 
    ]
  }
],

There are 2 ways to target values in a document for matching in a matching rule.

entityPropertyPath is a dot notation that indicates the location of a property in an entity instance by chaining together the property titles with a period.

documentXPath is XPath to the location of a node in the document holding the entity instance and can have a sibling namespaces property is an object in the form of { "prefix": "namespaceURI"}.

Matching on Exact Values

The most basic match rule takes the values from the input document for a given property and generates an element or JSON property value query with the property values as input.

The match rule for exact has the “matchType” value of “exact”. The exact match type does not currently take any options.

 
"matchRulesets": [
  {
    "name": "name - Exact",
    "weight": 3.5, 
    "matchRules": [
      {
        "entityPropertyPath": "name",
        "matchType": "exact",
        "options": {}
      } 
    ]
  }
],

Matching on Double Metaphone

The double metaphone match is driven by a dictionary document and leverages the spell.suggest function. This expands the values passed into the property query. See below to learn how to create and load a dictionary.

The match rule for exact has the “matchType” value of “doubleMetaphone”. The options contain “dictionaryURI” with the URI to the dictionary document and “distanceThreshold” is threshold for the double metaphone variance.

 
"matchRulesets": [
  {
    "name": "name - Double Metaphone",
    "weight": 3.5, 
    "matchRules": [
      {
        "entityPropertyPath": "name",
        "matchType": "doubleMetaphone",
        "options": {
          "dictionaryURI": "/nameDictionary.json",
          "distanceThreshold": 100
        }
      } 
    ]
  }
],

How to Create and Load a Dictionary

There are two types of dictionary documents you can load into Marklogic:

To create a dictionary document, use the spell:make-dictionary function:


spell:make-dictionary(
  words as xs:string*,
  [output-kind as xs:string]
) as item()
	  

To load a dictionary document into the database, use the spell:load function:


spell:load(
  path as xs:string,
  uri as xs:string
) as empty-sequence()
	  
For example, you can utilize these functions by:
  1. Creating an index for the field you intend to match.
  2. Creating a custom step that builds and updates the index.
  3. Running the custom step that updates the index before running the matching step.
Note: You may use the following sample XML and JSON dictionary documents if you do not intend to create your own. They are available at this GitHub repository: https://github.com/marklogic/dictionaries

Matching on Synonym

The synonym match is driven by a thesaurus document and leverages the thsr.expand function. This will expand the values passed into the property query. A URI to a thesaurus is required and optional XML as text can be used to filter results. (It is recommended to have separate thesaurus documents for filtering where possible.)

The match rule for exact has the “matchType” value of “synonym”. The options contain “thesaurusURI” with the URI to the thesaurus document and “filter” is an optional field of quoted XML that can filter thesaurus entries.

 
"matchRulesets": [
  {
    "name": "name - Synonym",
    "weight": 3.5, 
    "matchRules": [
      {
        "entityPropertyPath": "name",
        "matchType": "synonym",
        "options": {
          "thesaurusURI": "/thesauri/name-synonyms.xml",
          "filter": "<qualifier>english</qualifier>"
        }
      } 
    ]
  }
],

Matching on Zip

Matching on zip will construct property queries such that 5-digit zip codes can match with 9-digit zip codes and vice versa.

Note: To match 5-digit zipcodes against 9-digit zip codes, you will need to turn on three characters and trailing wildcard search indexes. See the Search Developer's Guide for more information on Understanding and Using Wildcard Searches.

The match rule for exact has the “matchType” value of “zip”.

Note: Multiple weights are not necessary for this match query and, the weight is assigned to the entire ruleset.
 
"matchRulesets": [
  {
    "name": "name - Zip",
    "weight": 1.5, 
    "matchRules": [
      {
        "entityPropertyPath": "name",
        "matchType": "zip",
        "options": {}
      } 
    ]
  }
],

Custom Matching

Custom matching allows a developer to provide their own function for constructing and returning cts queries. The module library can exist at any location in the modules database, but following the convention ‘/custom-modules/matching/<matchTypeName>.sjs’ is recommended.

The function signature of the custom match function must accept 3 arguments. The first argument is the property node or sequence of property nodes from the entity instance. The second argument is the matching rule from the match configuration. The third argument is the entire match configuration.

The match rule for exact has the “matchType” value of “custom”. “algorithmModulePath” defines where the location of the module library. “algorithmModuleNamespace” is for defining the namespace of the module library, if using XQuery. “algorithmFunction” is the function name exported by the module library.

 
"matchRulesets": [
  {
    "name": "name - Custom",
    "weight": 1.5, 
    "matchRules": [
      {
        "entityPropertyPath": "name",
        "matchType": "custom",
        "algorithmModuleNamespace": "",
        "algorithmModulePath": "/custom-modules/matching/nameMatch.sjs",
        "algorithmFunction": "nameMatch",
        "options": {}
      } 
    ]
  }
],
Note: To add custom functions, store your custom modules in the subdirectory ./root/custom-modules.

function customMatchString(nodeValues, matchRule, matchStep) {
  return fn.string(fn.head(nodeValues));
}

function customMatchSequence(nodeValues, matchRule, matchStep) {
  return nodeValues;
}

function customMatchArray(nodeValues, matchRule, matchStep) {
  return nodeValues.toArray();
}
      
Important: Custom functions can return boolean functions.

function customFunctionMatchRule(values) {
   return (documentB) => {
     return cts.contains(documentB, values);

  };
}
	  
Tip: To export these functions for a matching step configuration, use exports.functionName = functionName.

Reduce Matching Score

There may be a situation where a match should give a negative weight to reduce the likelihood of a match.

The reduction of the match score is now associated with an entire match ruleset. Setting the property “reduce” to the Boolean true will cause the weight to be negative.

 
"matchRulesets": [
  {
    "name": "name - Reduce",
    "weight": 1.5, 
    "reduce": true,  
    "matchRules": [
      {
        "entityPropertyPath": "address",
        "matchType": "exact",
        "options": {}
      },
      {
        "entityPropertyPath": "lastName",
        "matchType": "exact",
        "options": {}
      } 
    ]
  }
],

Thresholds and Actions

Thresholds are score limits that are associated with actions and they do not overlap. The actions will actually occur at the time the merge step is run. The actions provided out of the box are “notify” and “merge”. “notify” will create notification documents in the content database that define a match between documents which can be later reviewed. “merge” will compose a new entity instance based off of the group of entity instances that meet the threshold score.

Creating a Custom Action

The function signature of the custom action function must accept 3 arguments. The first argument is the URI string of the entity instance that triggered the match query. The second argument is the set of candidate entity instances that met the threshold score. The third argument is the entire merge configuration.

Example JavaScript signature:

householdAction(uri, matches, mergeConfiguration)

Example XQuery signature:

ns:household-action($uri as xs:string, $matches as node()*, $merge-options as node())

All action information is condensed into the thresholds. “thresholdName” defines the name for the threshold. “action” can have the values “notify”, “merge”, or “custom”. “score” is the bottom score required for a candidate entity instance to be placed into the threshold.

When using the “custom” action, “actionModulePath” defines where the location of the module library. “actionModuleNamespace” is for defining the namespace of the module library, if using XQuery. “actionFunction” is the function name exported by the module library.

 
"thresholds": [
  {
    "thresholdName": "similarThreshold",
    "action": "notify",
    "score": 6.5
  },
  {
    "thresholdName": "household",
    "action": "custom",
    "score": 8.5,
    "actionModulePath": "/custom-modules/matching/householdAction.xqy",
    "actionModuleNamespace": "http://marklogic.com/smart-mastering/action",
    "actionModuleFunction": "household-action"
  }
],

Merging

The merging configuration is stored as properties at the root of the merging step document.

Merge Rules

Merge rules define how an entity instance properties or document nodes are selected for the composite entity instance in the new merged document. There is the standard merge type and the custom merge type.

The default sorting order for properties is the last updated DateTime in descending order.

There are 2 ways to target values in a document for merging via a merge rule.

entityPropertyPath is a dot notation that indicates the location of a property in an entity instance by chaining together the property titles with a period.

documentXPath is XPath to the location of a node in the document holding the entity instance and can have a sibling namespaces property is an object in the form of { "prefix": "namespaceURI"}.

Defining the Last Updated Datetime Location

The default sort order for the properties output by the standard merge is the last updated dateTime, if available in the merged documents. The location of a dateTime value in the document can be defined by providing XPath via configuration.

 
{
  "mergeStrategies": [],
  "mergeRules": [],
  "lastUpdatedLocation": {
    "namespaces": {
      "es": "http://marklogic.com/entity-services",
      "sm": "http://marklogic.com/smart-mastering"
    },
    "documentXPath": "/es:envelope/es:headers/sm:sources/sm:source/sm:dateTime"
  }
}

Standard Merge

The standard merge provides a way to sort the properties from the candidate entity instances via source or length and select a max number of values or max number of sources. Values and sources differentiate in the case of arrays. Setting a max source of 1 on an array property will bring the entire array over from an entity instance. Setting a max value of 1 on an array property will bring a single item in the array or arrays from the set of candidate entity instances.

 
{
  "mergeRules": [
    {
      "entityPropertyPath": "name",
      "maxSources": 1,
      "priorityOrder": {
        "lengthWeight": 2,
        "sources": [
          {
            "sourceName": "favoriteSource",
            "weight": 12
          },
          {
            "sourceName": "lessFavoriteSource",
            "weight": 10
          }
        ]
      }
    }
  ]
}

Custom Merging

Custom merging allows a developer to provide their own function for selecting the properties from the candidate entity instances. The module library can exist at any location in the modules database, but following the convention ‘/custom-modules/merging/<mergeTypeName>.sjs’ is recommended.

The function signature of the custom merge function must accept 3 arguments. The first argument is the property name. The second argument is an array (or sequence in the case of XQuery) of property details. The third argument is the merge rule.

A property detail object has three properties: name, sources, and values. The value for sources is itself an object, with keys name (an identifier) extracted from the source; dateTime (an xs:dateTime) extracted from the source, if available; and documentUri (identifying this particular source document). The values key connects to the property value or values from this particular source document. The name key is the name of the property.

 
{
  "mergeRules": [
    {
      "entityPropertyPath": "addressLocalName",
      "mergeModulePath": "/custom/merge/strategy.sjs",
      "mergeModuleFunction": "mergeAddress",
      "options": {}
    }
  ]
}
      

Example XQuery custom merge function:

 
declare function algorithm:custom-merge-limit(
  $property-name as xs:QName,
  $properties as map:map*,
  $merge-rule as object-node()
) as map:map*
{
  let $default-limit := if ($merge-rule/entityPropertyPath = "Phone") then 5 else 10
  return fn:subsequence(
        $properties,
        fn:head(($merge-rule/maxValues, $default-limit))
  )
};

Example JavaScript custom merge function:

 
function customMergeLimit(propertyName, properties, mergeRule)
{
  let defaultLimit = mergeRule.entityPropertyPath === "Phone" ? 5 : 10;
  return fn.subsequence(
        properties,
        mergeRule.maxValues || defaultLimit
  );
}

Custom Merging: Additional Functionality

The applyDocumentContextInterceptors functionality applies custom logic and permissions to merged documents. See the example below.

Example configuration for merge step:

"applyDocumentContextInterceptors": [ {"path":"/full/module/path.sjs", "function": "customApplyDocumentContextInterceptor"}]

Example custom apply document context interceptor code:


function customApplyDocumentContextInterceptor(contentObject, actionDetails, targetEntity) {
  switch (actionDetails.action) {
   case "merge" :
    contentObject.context.collections.push(`sm-${targetEntity}-merged-intercepted`);

    break;

   case "notify":
    contentObject.context.collections.push(`sm-${targetEntity}-notification-intercepted`);

    break;

   case "no-action":
    contentObject.context.collections.push(`sm-${targetEntity}-mastered-intercepted`);

    break;

   default:
	   
 }
 return contentObject;

}