Mastering: Matching and Merging
Overview of mastering (matching and merging) in Data Hub.
About Mastering in MarkLogic Data Hub
Smart Mastering is a MarkLogic technology that searches for records in your data that refer to the same entity based on rules you define and then merges them depending on thresholds you specify.
In MarkLogic Data Hub, the mastering step involves two processes with their associated sets of rules:
- Matching determines if two records are candidates for merging, based on the degree of similarity between them and the weight of the comparison. Matching rules define the properties to compare, how to compare them, and what thresholds must be exceeded before taking action.
- Match options define how the records are compared.
- Match thresholds define the limits and the actions to take when those limits are exceeded.
- Merging handles the candidates accordingly, based on thresholds. Merging rules define how two or more records in the data would be merged together. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.
- Merge options define how the properties of the candidate records are combined.
- Merge strategies are sets of merge options that you can name and reuse.
- Merge collections are sets of records that have the same collection tags.
Smart Mastering Hardening
Data Hub provides smart mastering hardening. This section provides guidance on three test scenarios: test on a list of URIs with each other only, test on a list of URIs with each other and the database, test on All Data (everything in database with each other), with a limit of displaying 100 matches
We advise that when people are testing configs, they should have at least an idea of what a successful match means to them, so they can identify whether their matches are being made for the right reasons or not. The results table and comparison table should help them determine this)
We advise them to test with a small sample number of documents (ie. <250)
Matching
You can create one or more rules to determine if two or more records match. Each rule compares the values of a single property in the candidate records. The comparison can be one of the following types:
- Exact. Determines if the values of the specified entity property in two or more records are exactly the same.
- Synonym. Determines if the values of the specified entity property in two or more records are synonyms, according to the specified thesaurus.
- Double Metaphone. Determines if the values of the specified entity property in two or more records sound similar, based on the Double Metaphone algorithm. For example, "Smith" might sound like "Schmidt".
- Zip. Determines if the zip/postal code in two or more records match.
- Reduce. Reduces the significance of certain matches. For example, even if the addresses and last names of two records match, the similarity might not necessarily indicate that the two records refer to the same person, because they might be two members of the same family.
- Custom. Runs a function in your custom module to compare the values of a specified entity property in two or more records.
Then, you can specify thresholds and what to do when a threshold is exceeded. For example, exceeding a threshold could:
- Trigger an automatic merge.
- Send a notification.
- Run a custom module.
Merging
In a merge, a new record is created and the values from the original records could be combined and copied to the new record, according to the rules you specify. For example,
- You can restrict the number of unique values copied to the new record.
- You can restrict the number of data sources from which to copy values.
- You can specify that only records from specific datasets are allowed to be merged. And you can assign a weight to each source, so you can give priority to more reliable sources.
- You can also assign a weight to the length of a string.
If you use certain combinations of merge settings, you can save them as a strategy and refer to them by the strategy name.
You can also do your own merge using a custom module.
Merging is non-destructive. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.
Combined-Step versus Split-Step Mastering
Data Hub provides the ability to perform matching and merging either within a single step (mastering step) or with two steps (matching step and merging step).
In combined-step mastering, the records in the specified collection or query are compared with each other. If matches are found, a new merged record is created and the original records are archived. Both matching and merging are done in the same mastering step. Combined-step mastering must be run with exactly one thread.
- The matching step compares the records tagged with the specified source collection. If matches are found, the step creates match summaries that list the URIs of matching records, based on the matching rules that you specify in the step.
- The merging step reads the match summaries and merges the specified records. based on the merging rules that you specify in the step.
In most cases, split-step mastering is ideal, because it can use multiple threads and avoid locks more effectively.
Guidelines for Split-Step Mastering
With split-step mastering, keep the following guidelines in mind:
- When creating the matching step, specify a collection tag to add so you can undo the matching, if needed. Replace the collection tag before each run.
- Use the same database for both the Source Database and the Target Database in both the matching step and the merging step.
- Run the matching step only once for the same dataset. The merging step will process all existing match summaries, even if they were generated by different matching step runs on different snapshots of the data.
- When running the split mastering, allow the matching step to complete before starting the merging step.
- Run the merging step as soon as possible after the matching step.
- If a matching record changes between the steps, it will be merged anyway even if it no longer meets the matching criteria.
- If new records are ingested and mapped between the matching step and the merging steps, newly ingested and mapped records are not automatically added to the existing match summaries.
If significant changes were made to the data (i.e., modifications and newly mapped records) before the merging step is run, undo the previous matching then run the matching step again.
- Before running a matching step or a merging step, check the list of existing matching summaries to make sure that your merging step doesn't process out-of-date matches.
- If a record appears in multiple match summaries created in different matching step runs, the most recent match has precedence during merging.
Check for Previous Matching
To check if a matching step has been run, search for records tagged with the collection names that start with datahubMasteringMatchSummary. It might be followed by a hyphen and the name of your entity model.
- Go to Browse Data.
- Set the database to
FINAL
or the database you selected. - In the facets on the left, select the datahubMasteringMatchSummary collection.
Undo Previous Matching
To undo the matching step run, delete the match summaries created by that matching step run.
- Go to the Query Console.
- Search for all records with the datahubMasteringMatchSummary collection tag.
- Delete or archive the records found.
Examples of Matching and Merging Components
Matching leverages MarkLogic search capabilities to find candidate documents based on search criteria or match rules that the user defines. The weight given to each matching ruleset determines the overall score of the candidate document. Thresholds define actions that are taken given a candidate document’s weight. The actions assigned to a threshold are "merge", "notify", or "custom".
Merging acts on the output of the matching step. Merging rules can be defined to determine the survivorship of Entity properties, giving control of what is carried forward into a merged document. Merging configuration can also adjust collections applied to the different categorizations of documents that are output by the merging step. These categorizations include archived documents, merged documents, and notification documents indicating near matches.
Different Formats for Configuration
There are two formats for configuring matching and two for configuring merging.
The first set of formats align with the match/merge JSON options defined in the now deprecated Smart Mastering Core library, as the library’s functionality was incorporated into the Data Hub as of version 5.0. This format will continue to be supported throughout the Data Hub 5.x release cycle.
Matching
The matching configuration is stored as properties at the root of the match step document.
Rules
Matching rules are contained under the “matchRulesets” array with each match ruleset having a name, weight, and match rules. Every query provided by each match rule must be true against a candidate document for the weight to be added to the candidate’s score.
All of the information for the property and the match type is associated directly with the match rule. Each match rule defines the entity property path to the property being, the match type, and a set of options that are specific to the given match type.
"matchRulesets": [
{
"name": "name - Exact",
"weight": 3.5,
"matchRules": [
{
"entityPropertyPath": "name.lastName",
"matchType": "exact",
"options": {}
}
]
}
],
There are 2 ways to target values in a document for matching in a matching rule.
entityPropertyPath is a dot notation that indicates the location of a property in an entity instance by chaining together the property titles with a period.
documentXPath is XPath to the location of a node in the document holding the entity instance and can have a sibling namespaces property is an object in the form of { "prefix": "namespaceURI"}.
Matching on Exact Values
The most basic match rule takes the values from the input document for a given property and generates an element or JSON property value query with the property values as input.
The match rule for exact has the “matchType” value of “exact”. The exact match type does not currently take any options.
"matchRulesets": [
{
"name": "name - Exact",
"weight": 3.5,
"matchRules": [
{
"entityPropertyPath": "name",
"matchType": "exact",
"options": {}
}
]
}
],
Matching on Double Metaphone
The double metaphone match is driven by a dictionary document and leverages the spell.suggest function. This will expand the values passed into the property query.
The match rule for exact has the “matchType” value of “doubleMetaphone”. The options contain “dictionaryURI” with the URI to the dictionary document and “distanceThreshold” is threshold for the double metaphone variance.
"matchRulesets": [
{
"name": "name - Double Metaphone",
"weight": 3.5,
"matchRules": [
{
"entityPropertyPath": "name",
"matchType": "doubleMetaphone",
"options": {
"dictionaryURI": "/nameDictionary.json",
"distanceThreshold": 100
}
}
]
}
],
Matching on Synonym
The synonym match is driven by a thesaurus document and leverages the thsr.expand function. This will expand the values passed into the property query. A URI to a thesaurus is required and optional XML as text can be used to filter results. (It is recommended to have separate thesaurus documents for filtering where possible.)
The match rule for exact has the “matchType” value of “synonym”. The options contain “thesaurusURI” with the URI to the thesaurus document and “filter” is an optional field of quoted XML that can filter thesaurus entries.
"matchRulesets": [
{
"name": "name - Synonym",
"weight": 3.5,
"matchRules": [
{
"entityPropertyPath": "name",
"matchType": "synonym",
"options": {
"thesaurusURI": "/thesauri/name-synonyms.xml",
"filter": "<qualifier>english</qualifier>"
}
}
]
}
],
Matching on Zip
Matching on zip will construct property queries such that 5-digit zip codes can match with 9-digit zip codes and vice versa.
The match rule for exact has the “matchType” value of “zip”.
"matchRulesets": [
{
"name": "name - Zip",
"weight": 1.5,
"matchRules": [
{
"entityPropertyPath": "name",
"matchType": "zip",
"options": {}
}
]
}
],
Custom Matching
Custom matching allows a developer to provide their own function for constructing and returning cts queries. The module library can exist at any location in the modules database, but following the convention ‘/custom-modules/matching/<matchTypeName>.sjs’ is recommended.
The function signature of the custom match function must accept 3 arguments. The first argument is the property node or sequence of property nodes from the entity instance. The second argument is the matching rule from the match configuration. The third argument is the entire match configuration.
The match rule for exact has the “matchType” value of “custom”. “algorithmModulePath” defines where the location of the module library. “algorithmModuleNamespace” is for defining the namespace of the module library, if using XQuery. “algorithmFunction” is the function name exported by the module library.
"matchRulesets": [
{
"name": "name - Custom",
"weight": 1.5,
"matchRules": [
{
"entityPropertyPath": "name",
"matchType": "custom",
"algorithmModuleNamespace": "",
"algorithmModulePath": "/custom-modules/matching/nameMatch.sjs",
"algorithmFunction": "nameMatch",
"options": {}
}
]
}
],
Reduce Matching Score
There may be a situation where a match should give a negative weight to reduce the likelihood of a match.
The reduction of the match score is now associated with an entire match ruleset. Setting the property “reduce” to the Boolean true will cause the weight to be negative.
"matchRulesets": [
{
"name": "name - Reduce",
"weight": 1.5,
"reduce": true,
"matchRules": [
{
"entityPropertyPath": "address",
"matchType": "exact",
"options": {}
},
{
"entityPropertyPath": "lastName",
"matchType": "exact",
"options": {}
}
]
}
],
Thresholds and Actions
Thresholds are score limits that are associated with actions and they do not overlap. The actions will actually occur at the time the merge step is run. The actions provided out of the box are “notify” and “merge”. “notify” will create notification documents in the content database that define a match between documents which can be later reviewed. “merge” will compose a new entity instance based off of the group of entity instances that meet the threshold score.
Creating a Custom Action
The function signature of the custom action function must accept 3 arguments. The first argument is the URI string of the entity instance that triggered the match query. The second argument is the set of candidate entity instances that met the threshold score. The third argument is the entire merge configuration.
Example JavaScript signature:
householdAction(uri, matches, mergeConfiguration)
Example XQuery signature:
ns:household-action($uri as xs:string, $matches as node()*, $merge-options as node())
All action information is condensed into the thresholds. “thresholdName” defines the name for the threshold. “action” can have the values “notify”, “merge”, or “custom”. “score” is the bottom score required for a candidate entity instance to be placed into the threshold.
When using the “custom” action, “actionModulePath” defines where the location of the module library. “actionModuleNamespace” is for defining the namespace of the module library, if using XQuery. “actionFunction” is the function name exported by the module library.
"thresholds": [
{
"thresholdName": "similarThreshold",
"action": "notify",
"score": 6.5
},
{
"thresholdName": "household",
"action": "custom",
"score": 8.5,
"actionModulePath": "/custom-modules/matching/householdAction.xqy",
"actionModuleNamespace": "http://marklogic.com/smart-mastering/action",
"actionModuleFunction": "household-action"
}
],
Merging
The merging configuration is stored as properties at the root of the merging step document.
Merge Rules
Merge rules define how an entity instance properties or document nodes are selected for the composite entity instance in the new merged document. There is the standard merge type and the custom merge type.
The default sorting order for properties is the last updated DateTime in descending order.
There are 2 ways to target values in a document for merging via a merge rule.
entityPropertyPath is a dot notation that indicates the location of a property in an entity instance by chaining together the property titles with a period.
documentXPath is XPath to the location of a node in the document holding the entity instance and can have a sibling namespaces property is an object in the form of { "prefix": "namespaceURI"}.
Defining the Last Updated Datetime Location
The default sort order for the properties output by the standard merge is the last updated dateTime, if available in the merged documents. The location of a dateTime value in the document can be defined by providing XPath via configuration.
{
"mergeStrategies": [],
"mergeRules": [],
"lastUpdatedLocation": {
"namespaces": {
"es": "http://marklogic.com/entity-services",
"sm": "http://marklogic.com/smart-mastering"
},
"documentXPath": "/es:envelope/es:headers/sm:sources/sm:source/sm:dateTime"
}
}
Standard Merge
The standard merge provides a way to sort the properties from the candidate entity instances via source or length and select a max number of values or max number of sources. Values and sources differentiate in the case of arrays. Setting a max source of 1 on an array property will bring the entire array over from an entity instance. Setting a max value of 1 on an array property will bring a single item in the array or arrays from the set of candidate entity instances.
{
"mergeRules": [
{
"entityPropertyPath": "name",
"maxSources": 1,
"priorityOrder": {
"lengthWeight": 2,
"sources": [
{
"sourceName": "favoriteSource",
"weight": 12
},
{
"sourceName": "lessFavoriteSource",
"weight": 10
}
]
}
}
]
}
Custom Merging
Custom merging allows a developer to provide their own function for selecting the properties from the candidate entity instances. The module library can exist at any location in the modules database, but following the convention ‘/custom-modules/merging/<mergeTypeName>.sjs’ is recommended.
The function signature of the custom merge function must accept 3 arguments. The first argument is the property name. The second argument is an array (or sequence in the case of XQuery) of property details. The third argument is the merge rule.
A property detail object has three properties: name, sources, and values. The value for sources is itself an object, with keys name (an identifier) extracted from the source; dateTime (an xs:dateTime) extracted from the source, if available; and documentUri (identifying this particular source document). The values key connects to the property value or values from this particular source document. The name key is the name of the property.
{
"mergeRules": [
{
"entityPropertyPath": "addressLocalName",
"mergeModulePath": "/custom/merge/strategy.sjs",
"mergeModuleFunction": "mergeAddress",
"options": {}
}
]
}
Example XQuery custom merge function:
declare function algorithm:custom-merge-limit(
$property-name as xs:QName,
$properties as map:map*,
$merge-rule as object-node()
) as map:map*
{
let $default-limit := if ($merge-rule/entityPropertyPath = "Phone") then 5 else 10
return fn:subsequence(
$properties,
fn:head(($merge-rule/maxValues, $default-limit))
)
};
Example JavaScript custom merge function:
function customMergeLimit(propertyName, properties, mergeRule)
{
let defaultLimit = mergeRule.entityPropertyPath === "Phone" ? 5 : 10;
return fn.subsequence(
properties,
mergeRule.maxValues || defaultLimit
);
}