Mastering: Matching and Merging

Overview of mastering (matching and merging) in Data Hub.

About Mastering in MarkLogic Data Hub

Smart Mastering is a MarkLogic technology that searches for records in your data that refer to the same entity based on rules you define and then merges them depending on thresholds you specify.

In MarkLogic Data Hub, the mastering step involves two processes with their associated sets of rules:

  • Matching determines if two records are candidates for merging, based on the degree of similarity between them and the weight of the comparison. Matching rules define the properties to compare, how to compare them, and what thresholds must be exceeded before taking action.
    • Match options define how the records are compared.
    • Match thresholds define the limits and the actions to take when those limits are exceeded.
  • Merging handles the candidates accordingly, based on thresholds. Merging rules define how two or more records in the data would be merged together. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.
    • Merge options define how the properties of the candidate records are combined.
    • Merge strategies are sets of merge options that you can name and reuse.
    • Merge collections are sets of records that have the same collection tags.

Smart Mastering Hardening

Data Hub provides smart mastering hardening. This section provides guidance on three test scenarios: test on a list of URIs with each other only, test on a list of URIs with each other and the database, test on All Data (everything in database with each other), with a limit of displaying 100 matches

We advise that when people are testing configs, they should have at least an idea of what a successful match means to them, so they can identify whether their matches are being made for the right reasons or not. The results table and comparison table should help them determine this)

We advise them to test with a small sample number of documents (ie. <250)

Matching

You can create one or more rules to determine if two or more records match. Each rule compares the values of a single property in the candidate records. The comparison can be one of the following types:

  • Exact. Determines if the values of the specified entity property in two or more records are exactly the same.
  • Synonym. Determines if the values of the specified entity property in two or more records are synonyms, according to the specified thesaurus.
  • Double Metaphone. Determines if the values of the specified entity property in two or more records sound similar, based on the Double Metaphone algorithm. For example, "Smith" might sound like "Schmidt".
  • Zip. Determines if the zip/postal code in two or more records match.
  • Reduce. Reduces the significance of certain matches. For example, even if the addresses and last names of two records match, the similarity might not necessarily indicate that the two records refer to the same person, because they might be two members of the same family.
  • Custom. Runs a function in your custom module to compare the values of a specified entity property in two or more records.

Then, you can specify thresholds and what to do when a threshold is exceeded. For example, exceeding a threshold could:

  • Trigger an automatic merge.
  • Send a notification.
  • Run a custom module.

Merging

In a merge, a new record is created and the values from the original records could be combined and copied to the new record, according to the rules you specify. For example,

  • You can restrict the number of unique values copied to the new record.
  • You can restrict the number of data sources from which to copy values.
  • You can specify that only records from specific datasets are allowed to be merged. And you can assign a weight to each source, so you can give priority to more reliable sources.
  • You can also assign a weight to the length of a string.

If you use certain combinations of merge settings, you can save them as a strategy and refer to them by the strategy name.

You can also do your own merge using a custom module.

Merging is non-destructive. A new record is created with the combined contents of the original records, according to the merging rules you create. The original records stay in the database and are tagged as archived.

Combined-Step versus Split-Step Mastering

Data Hub provides the ability to perform matching and merging either within a single step (mastering step) or with two steps (matching step and merging step).

In combined-step mastering, the records in the specified collection or query are compared with each other. If matches are found, a new merged record is created and the original records are archived. Both matching and merging are done in the same mastering step. Combined-step mastering must be run with exactly one thread.

In split-step mastering,
  • The matching step compares the records tagged with the specified source collection. If matches are found, the step creates match summaries that list the URIs of matching records, based on the matching rules that you specify in the step.
  • The merging step reads the match summaries and merges the specified records. based on the merging rules that you specify in the step.

In most cases, split-step mastering is ideal, because it can use multiple threads and avoid locks more effectively.

Guidelines for Split-Step Mastering

With split-step mastering, keep the following guidelines in mind:

  • When creating the matching step, specify a collection tag to add so you can undo the matching, if needed. Replace the collection tag before each run.
  • Use the same database for both the Source Database and the Target Database in both the matching step and the merging step.
  • Run the matching step only once for the same dataset. The merging step will process all existing match summaries, even if they were generated by different matching step runs on different snapshots of the data.
  • When running the split mastering, allow the matching step to complete before starting the merging step.
  • Run the merging step as soon as possible after the matching step.
    • If a matching record changes between the steps, it will be merged anyway even if it no longer meets the matching criteria.
    • If new records are ingested and mapped between the matching step and the merging steps, newly ingested and mapped records are not automatically added to the existing match summaries.

    If significant changes were made to the data (i.e., modifications and newly mapped records) before the merging step is run, undo the previous matching then run the matching step again.

  • Before running a matching step or a merging step, check the list of existing matching summaries to make sure that your merging step doesn't process out-of-date matches.
  • If a record appears in multiple match summaries created in different matching step runs, the most recent match has precedence during merging.

Check for Previous Matching

To check if a matching step has been run, search for records tagged with the collection names that start with datahubMasteringMatchSummary. It might be followed by a hyphen and the name of your entity model.

  1. Go to Browse Data.
  2. Set the database to FINAL or the database you selected.
  3. In the facets on the left, select the datahubMasteringMatchSummary collection.

Undo Previous Matching

To undo the matching step run, delete the match summaries created by that matching step run.

  1. Go to the Query Console.
  2. Search for all records with the datahubMasteringMatchSummary collection tag.
  3. Delete or archive the records found.

Examples of Matching and Merging Components

Matching leverages MarkLogic search capabilities to find candidate documents based on search criteria or match rules that the user defines. The weight given to each matching ruleset determines the overall score of the candidate document. Thresholds define actions that are taken given a candidate document’s weight. The actions assigned to a threshold are "merge", "notify", or "custom".

Merging acts on the output of the matching step. Merging rules can be defined to determine the survivorship of Entity properties, giving control of what is carried forward into a merged document. Merging configuration can also adjust collections applied to the different categorizations of documents that are output by the merging step. These categorizations include archived documents, merged documents, and notification documents indicating near matches.

Different Formats for Configuration

There are two formats for configuring matching and two for configuring merging.

The first set of formats align with the match/merge JSON options defined in the now deprecated Smart Mastering Core library, as the library’s functionality was incorporated into the Data Hub as of version 5.0. This format will continue to be supported throughout the Data Hub 5.x release cycle.

As of Data Hub 5.4, a new set of match/merge formats have been created which opens up new capabilities and eliminates redundant configuration that is completed by Modeling with Entity Services. This new set of match/merge formats must be associated with an Entity Type and are required for compatibility with the Hub Central User Interface, but will not be compatible with Data Hub QuickStart.

See the documentation on Hub Central Migration for more details on migrating matching and merging steps to Hub Central.

Matching

The match step generated in QuickStart stores matching configuration in the flow step options “matchOptions” property.

In Hub Central the matching configuration is stored as properties at the root of the match step document.

Rules

Using QuickStart, matching rules are stored in the “scoring” property with child properties “add”, “expand”, and “reduce” to separate different types of matching rules. Information about the properties that are matched on are stored in the “propertyDefs” property.

Using Hub Central, matching rules are contained under the “matchRulesets” array with each match ruleset having a name, weight, and match rules. Every query provided by each match rule must be true against a candidate document for the weight to be added to the candidate’s score.

Unlike the QuickStart format, all of the information for the property and the match type is associated directly with the match rule. Each match rule defines the entity property path to the property being, the match type, and a set of options that are specific to the given match type.

{ "name": "last name - Exact",
        "weight": 3.5,   
        "matchRules": [
           {
              "entityPropertyPath": "name.lastName",
              "matchType": "exact",
              "options": {}
           } ]
     }

There are 2 ways to target values in a document for matching in a matching rule.

entityPropertyPath is a dot notation that indicates the location of a property in an entity instance by chaining together the property titles with a period.

documentXPath is XPath to the location of a node in the document holding the entity instance and can have a sibling namespacesproperty is an object in the form of { "prefix": "namespaceURI"}.

Matching on Exact Values

The most basic match rule takes the values from the input document for a given property and generates an element or JSON property value query with the property values as input.

Using QuickStart, exact matching rules are stored under the “add” property of “scoring”.

{
 
    "propertyDefs": {
 
      "property": [
 
        { "namespace": "", "localname": "lastName", "name": "lastName" }
 
      ]
 
    },
 
    "scoring": {
 
      "add": [
 
        { "propertyName": "lastName", "weight": "3.5" }
 
      ]
 
   }
 
   ...
}

Using Hub Central, the match rule for exact has the “matchType” value of “exact”. The exact match type does not currently take any options.

{
 
  ...
 
 "matchRulesets": [
 
     { "name": "name - Exact",
        "weight": 3.5,   
        "matchRules": [
           {
              "entityPropertyPath": "name",
              "matchType": "exact",
              "options": {}
           } ]
     }
  ]
}

Matching on Double Metaphone

The double metaphone match is driven by a dictionary document and leverages the spell.suggest function. This will expand the values passed into the property query.

Using QuickStart, the match rule is stored in the “expand” property of “scoring”. The “algorithmRef” is set as “double-metaphone”, “dictionary” has the URI for the dictionary used to suggest values for the property query, and “distanceThreshold” is threshold for the double metaphone variance.

{
    "propertyDefs": {
      "property": [
        { "namespace": "", "localname": "name", "name": "name" }
      ]
    },
    "scoring": {
      "expand": [
        {
          "propertyName": "name",
          "algorithmRef": "double-metaphone",
          "weight": "2.5",
          "dictionary": "/nameDictionary.json",
          "distanceThreshold": "100"
        }
      ]
   }
   ...
}

Using Hub Central, the match rule for exact has the “matchType” value of “doubleMetaphone”. The options contain “dictionaryURI” with the URI to the dictionary document and “distanceThreshold” is threshold for the double metaphone variance.

{
 
  ...
 
 "matchRulesets": [
   {
     "name": "name - Double Metaphone",
     "weight": 2.5,   
     "matchRules": [
       {
         "entityPropertyPath": "name",
                    "matchType": "doubleMetaphone",
                     "options": {
                        "dictionaryURI": "/nameDictionary.json", 
           "distanceThreshold": 100
                      }
                 }
              ]
   }
  ]
}

Matching on Synonym

The synonym match is driven by a thesaurus document and leverages the thsr.expand function. This will expand the values passed into the property query. A URI to a thesaurus is required and optional XML as text can be used to filter results. (It is recommended to have separate thesaurus documents for filtering where possible.)

Using QuickStart, the match rule is stored in the “expand” property of “scoring”. The “algorithmRef” is set as “thesaurus”, “thesaurus” has the URI for the thesaurus used to expand values for the property query, and “filter” is optional quoted XML that can filter thesaurus entries.

{
    "propertyDefs": {
      "property": [
        { "namespace": "", "localname": "name", "name": "name" }
      ]
    },
    "scoring": {
      "expand": [
        {
          "propertyName": "name",
          "algorithmRef": "thesaurus",
          "weight": "2.5",
          "thesaurus": "/thesauri/name-synonyms.xml", 
          "filter": "<qualifier>english</qualifier>"
        }
      ]
   }
   ...
}

Using Hub Central, the match rule for exact has the “matchType” value of “synonym”. The options contain “thesaurusURI” with the URI to the thesaurus document and “filter” is an optional field of quoted XML that can filter thesaurus entries.

{
 
  ...
  "matchRulesets": [
    {
      "name": "name - Synonym",
      "weight": 2.5,   
      "matchRules": [
        {
          "entityPropertyPath": "name",
          "matchType": "synonym",
          "options": {
            "thesaurusURI": "/thesauri/name-synonyms.xml",
            "filter": "<qualifier>english</qualifier>"
          }
        }
      ]
    }
  ]
}
 

Matching on Zip

Matching on zip will construct property queries such that 5-digit zip codes can match with 9-digit zip codes and vice versa.

Note: In order to match 5-digit zipcodes against 9-digit zip codes you will need to turn on three characters and trailing wildcard search indexes. See the Search Developer's Guide for more information on "Understanding and Using Wildcard Searches".

Using QuickStart, the match rule is stored in the “expand” property of “scoring”. The “algorithmRef” is set as “zip-match”, “zip” is an array of objects with an “origin” property with “5” or “9” and a “weight” property. (It is now recommended the weights are the same.)

{  "propertyDefs": {
    "property": [
      { "namespace": "", "localname": "LocationPostalCode", "name": "zip" }
    ]
  },
  "scoring": {
    "expand": [
      {
        "propertyName": "zip",
        "algorithmRef": "zip-match",
        "zip": [
          { "origin": "5", "weight": "1.5" },
          { "origin": "9", "weight": "1" }
        ]
      }
    ]
  }
 
}

Using Hub Central, the match rule for exact has the “matchType” value of “zip”.

Note: It has been determined that multiple weights aren't necessary for this match query and the weight is assigned to the entire ruleset.

{
 
  ...
 
 "matchRulesets": [
 
     { "name": "zip - Zip",
        "weight": 1.5,   
        "matchRules": [
           {
              "entityPropertyPath": "zip",
              "matchType": "zip",
              "options": {}
           } ]
     }
  ]
}

Custom Matching

Custom matching allows a developer to provide their own function for constructing and returning cts queries. The module library can exist at any location in the modules database, but following the convention ‘/custom-modules/matching/<matchTypeName>.sjs’ is recommended.

The function signature of the custom match function must accept 3 arguments. The first argument is the property node or sequence of property nodes from the entity instance. The second argument is the matching rule from the match configuration. The third argument is the entire match configuration. The structure of the second and third arguments will look different depending on whether the QuickStart or Hub Central format is used.

Note: The legacy QuickStart format is converted to an XML equivalent when the custom function is defined in XQuery. This is behavior is maintained for backward compatibility. Once converted to the Hub Central format, the rule and options will not be converted to XML prior to being passed to the XQuery function. They will each be passed as an object-node() in the same structure stored in the project and database. JavaScript functions will continue to receive the rule and options as JSON objects but reflecting the Hub Central format.

function nameMatch(propertyNodes, matchRule, matchConfiguration)

Using QuickStart, custom functions are defined in the “algorithms” section of the match options and linked to a match rule with “algorithmRef” property.

{
    "propertyDefs": {
      "property": [
        { "namespace": "", "localname": "name", "name": "name" }
      ]
    },   
    "algorithms": {
      "algorithm": [
        {
          "name": "custom-name-match",
          "function": "nameMatch",
          "at": "/custom-modules/matching/nameMatch.sjs",
          "namespace": ""
        }
      ]
    },
    "scoring": {
       "expand": [
        {
          "propertyName": "name",
          "algorithmRef": "custom-name-match",
          "weight": "2.5"
        }
      ]
   }
 
   ...
}

Using Hub Central, the match rule for exact has the “matchType” value of “custom”. “algorithmModulePath” defines where the location of the module library. “algorithmModuleNamespace” is for defining the namespace of the module library, if using XQuery. “algorithmFunction” is the function name exported by the module library.

{
  ...
  "matchRulesets": [
    {
      "name": "name - Custom",
      "weight": 2.5,
      "matchRules": [
        {
          "entityPropertyPath": "name",
          "matchType": "custom",
          "algorithmModuleNamespace": "",
          "algorithmModulePath": "/custom-modules/matching/nameMatch.sjs",
          "algorithmFunction": "nameMatch",
          "options": {}
        }
      ]
    }
  ]
}

Conversion to Hub Central format

When converting from the QuickStart format, custom functions may need to change how they access values from the match rule or the match configuration and in the case of XQuery a strongly typed function signature will need to be updated.

Example XQuery custom match before conversion:

QuickStart Example:

declare function algorithm:match-via-tde-row(
  $values as item()*,
  $expand-rule as element(match:expand),
  $match-configuration as element(match:options)
) as cts:query*
{
  let $property-name := $expand-xml/@property-name
  let $entity-type := $match-configuration/match:target-entity
  let $property-column := sem:iri("http://marklogic.com/column/" || fn:replace($entity-type, "^.*/([^/]+/[^/]+)$", "$1") || "/" || $property-name)
  return
    cts:triple-query((), $property-column, $values)
};

Example XQuery custom match after conversion:

Hub Central Example:

declare function algorithm:match-via-tde-row(
  $values as item()*,
  $match-rule as object-node(),
  $match-step as object-node()
) as cts:query*
{
  let $property-name := $match-rule/entityPropertyPath
  let $entity-type := $match-step/targetEntityType
  let $property-column := sem:iri("http://marklogic.com/column/" || fn:replace($entity-type, "^.*/([^/]+/[^/]+)$", "$1") || "/" || $property-name)
  return
    cts:triple-query((), $property-column, $values)
};
 

Example Javascript custom match before conversion:

QuickStart Example:

function matchViaTdeRow(values, expandRule, matchConfiguration)
{
  let propertyName = expandRule.propertyName;
  let entityType = matchConfiguration.targetEntity;
  let propertyColumn = sem.iri("http://marklogic.com/column/" + fn.replace(entityType, "^.*/([^/]+/[^/]+)$", "$1") + "/" + propertyName)
  return cts.tripleQuery(null, propertyColumn, values);
};

Example JavaScript custom match after conversion:

Hub Central Example:

function matchViaTdeRow(values, matchRule, matchConfiguration)
{
  let propertyName = matchRule.propertyName;
  let entityType = matchStep.targetEntityType;
  let propertyColumn = sem.iri("http://marklogic.com/column/" + fn.replace(entityType, "^.*/([^/]+/[^/]+)$", "$1") + "/" + propertyName)
  return cts.tripleQuery(null, propertyColumn, values);
};

Reduce Matching Score

There may be a situation where a match should give a negative weight to reduce the likelihood of a match. The method of doing this between the QuickStart format and the Hub Central format differs due to the introduction of rulesets.

Using QuickStart, the reducing match rule is stored under “reduce” in the “scoring” property. The “algorithmRef” is set as “standard-reduction”, “allMatch” is a JSON object with the child “property” as an array of strings listing the entity instance properties that much exact match to reduce the score.

{
    "propertyDefs": {
      "property": [
        { "namespace": "", "localname": "lastName", "name": "lastName" },
        { "namespace": "", "localname": "address", "name": "address" }
      ]
    },
    "scoring": {
      "reduce": [ {
          "allMatch" : {
               "property" : [ "address", "lastName" ]
        },
        "algorithmRef" : "standard-reduction",
        "weight" : "5"
   } ]
 }
   ...
}

Using Hub Central, the reduction of the match score is now associated with an entire match ruleset. Setting the property “reduce” to the Boolean true will cause the weight to be negative.

{
  ...
 "matchRulesets": [
     { "name": "address,lastName - Reduce",
        "weight": 5,
        "reduce": true,  
        "matchRules": [
          {
              "entityPropertyPath": "address",
              "matchType": "exact",
              "options": {}
          },
          {
              "entityPropertyPath": "lastName",
              "matchType": "exact",
              "options": {}
           }
        ]
     }
  ]
}

Thresholds and Actions

Thresholds are score limits that are associated with actions and they do not overlap. The actions will actually occur at the time the merge step is run. The actions provided out of the box are “notify” and “merge”. “notify” will create notification documents in the content database that define a match between documents which can be later reviewed. “merge” will compose a new entity instance based off of the group of entity instances that meet the threshold score.

Creating a Custom Action

The function signature of the custom action function must accept 3 arguments. The first argument is the URI string of the entity instance that triggered the match query. The second argument is the set of candidate entity instances that met the threshold score. The third argument is the entire merge configuration. The structure of the third argument will look different depending on whether the QuickStart or Hub Central format is used.

Note: The legacy QuickStart format is converted to an XML equivalent when the custom function is defined in XQuery. This is behavior is maintained for backward compatibility. Once converted to the Hub Central format, the options will not be converted to XML prior to being passed to the XQuery function. They will be passed as an object-node() in the same structure stored in the project and database. JavaScript functions will continue to receive the options as a JSON object but reflecting the Hub Central format.

Example JavaScript signature:

householdAction(uri, matches, mergeConfiguration)

Example XQuery signature:

ns:household-action($uri as xs:string, $matches as node()*, $merge-options as node())

Using QuickStart, custom actions are defined under the “actions” section of the match options and the threshold references the defined action by name in the “action” property.

{
  ...
  "actions": {
    "action": [
      {
        "name": "household-action",
        "function": "household-action",
        "namespace": "http://marklogic.com/smart-mastering/action",
        "at": "/custom-modules/matching/custom-action.xqy"
      }
    ]
  },
  "thresholds": {
    "threshold": [
      {
        "above": "6.5",
        "label": "similarThreshold",
        "action": "notify"
      },
      {
        "above": "8.5",
        "label": "household",
        "action": "household-action"
      },
      {
        "above": "12",
        "label": "sameThreshold",
        "action": "merge"
      }
    ]
  }
}

Using Hub Central, all action information is condensed into the thresholds. “thresholdName” defines the name for the threshold. “action” can have the values “notify”, “merge”, or “custom”. “score” is the bottom score required for a candidate entity instance to be placed into the threshold.

When using the “custom” action, “actionModulePath” defines where the location of the module library. “actionModuleNamespace” is for defining the namespace of the module library, if using XQuery. “actionFunction” is the function name exported by the module library.

{
    "thresholds": [
        {
            "thresholdName": "similarThreshold",
            "action": "notify",
            "score": 6.5
        },
        {
            "thresholdName": "household",
            "action": "custom",
            "score": 8.5,
            "actionModulePath": "/custom-modules/matching/householdAction.xqy",
            "actionModuleNamespace": "http://marklogic.com/smart-mastering/action",
            "actionModuleFunction": "household-action"
        },
        {
            "thresholdName": "sameThreshold",
            "action": "merge",
            "score": 12
        }
    ]
}

Merging

The merging step generated in QuickStart stores matching configuration in the flow step options’ mergeOptions property.

In Hub Central, the merging configuration is stored as properties at the root of the merging step document.

Merge Rules

Merge rules define how an entity instance properties or document nodes are selected for the composite entity instance in the new merged document. There is the standard merge type and the custom merge type.

The default sorting order for properties is the last updated DateTime in descending order.

Using QuickStart, the merging property references a properties definition that can define a path or element.

Using Hub Central, there are 2 ways to target values in a document for merging via a merge rule.

entityPropertyPath is a dot notation that indicates the location of a property in an entity instance by chaining together the property titles with a period.

documentXPath is XPath to the location of a node in the document holding the entity instance and can have a sibling namespaces property is an object in the form of { "prefix": "namespaceURI"}.

Defining the Last Updated Datetime Location

The default sort order for the properties output by the standard merge is the last updated dateTime, if available in the merged documents. The location of a dateTime value in the document can be defined by providing XPath via configuration.

QuickStart Example:

{
 
  "algorithms": {
    "stdAlgorithm": {
      "namespaces": {
        "sm": "http://marklogic.com/smart-mastering",
        "es": "http://marklogic.com/entity-services"
      },
      "timestamp": {
        "path": "/es:envelope/es:headers/sm:sources/sm:source/sm:dateTime"
      }
    }
  }
}

Hub Central Example:

{
  "mergeStrategies": [],
  "mergeRules": [],
  "lastUpdatedLocation": {
    "namespaces": {
      "es": "http://marklogic.com/entity-services",
      "sm": "http://marklogic.com/smart-mastering"
    },
    "documentXPath": "/es:envelope/es:headers/sm:sources/sm:source/sm:dateTime"
  }
}

Standard Merge

The standard merge provides a way to sort the properties from the candidate entity instances via source or length and select a max number of values or max number of sources. Values and sources differentiate in the case of arrays. Setting a max source of 1 on an array property will bring the entire array over from an entity instance. Setting a max value of 1 on an array property will bring a single item in the array or arrays from the set of candidate entity instances.

QuickStart Example:

{
  "merging": [
    {
      "propertyName": "name",
      "algorithmRef": "standard",
      "length": {
        "weight": "2"
      },
      "name": "myFavoriteSource",
      "maxSources": 1,
      "sourceWeights": [
        {
          "source": {
            "name": "favoriteSource",
            "weight": "12"
          }
        },
        {
          "source": {
            "name": "lessFavoriteSource",
            "weight": "10"
          }
        }
      ]
    }
  ],
  "propertyDefs": {
    "properties": [
      {
        "localname": "name",
        "name": "name"
      }
    ]
  }
}
 

Hub Central Example:

{
  "mergeRules": [
    {
      "entityPropertyPath": "name",
      "maxSources": 1,
      "priorityOrder": {
        "lengthWeight": 2,
        "sources": [
          {
            "sourceName": "favoriteSource",
            "weight": 12
          },
          {
            "sourceName": "lessFavoriteSource",
            "weight": 10
          }
        ]
      }
    }
  ]
}
 

Custom Merging

Custom matching allows a developer to provide their own function for selecting the properties from the candidate entity instances. The module library can exist at any location in the modules database, but following the convention ‘/custom-modules/merging/<mergeTypeName>.sjs’ is recommended.

The function signature of the custom merge function must accept 3 arguments. The first argument is the property name. The second argument is an array (or sequence in the case of XQuery) of property details. The third argument is the merge rule. The structure of the third argument will look different depending on whether the QuickStart or Hub Central format is used.

Note: The legacy QuickStart format is converted to an XML equivalent when the custom function is defined in XQuery. This is behavior is maintained for backward compatibility. Once converted to the Hub Central format, the rule will not be converted to XML prior to being passed to the XQuery function. The rule will be passed as an object-node() in the same structure stored in the project and database. JavaScript functions will continue to receive the rule as a JSON object but reflecting the Hub Central format.

function nameMerge(propertyName, propertyDetails, mergeRule)

A property detail object has three properties: name, sources, and values. The value for sources is itself an object, with keys name (an identifier) extracted from the source; dateTime (an xs:dateTime) extracted from the source, if available; and documentUri (identifying this particular source document). The values key connects to the property value or values from this particular source document. The name key is the name of the property.

QuickStart Example:

{
  "propertyDefs": {
    "properties": [
      {
        "localname": "addressLocalName",
        "name": "addressName"
      }
    ]
  },
  "algorithms": {
    "custom": [
      {
        "name": "addressAlgorithm",
        "function": "mergeAddress",
        "at": "/custom/merge/strategy.sjs"
      }
    ]
  },
  "merging": [
    {
      "propertyName": "addressName",
      "algorithmRef": "addressAlgorithm"
    }
  ]
}

Hub Central Example:

{
  "mergeRules": [
    {
      "entityPropertyPath": "addressLocalName",
      "mergeModulePath": "/custom/merge/strategy.sjs",
      "mergeModuleFunction": "mergeAddress",
      "options": {}
    }
  ]
}

Conversion to Hub Central format

When converting from the QuickStart format, custom functions may need to change how they access values from the merge rule and in the case of XQuery a strongly typed function signature will need to be updated.

Example XQuery custom merge before conversion:

QuickStart Example:

declare function algorithm:custom-merge-limit(
  $property-name as xs:QName,
  $properties as map:map*,
  $merge-rule as element(merging:merge)
) as map:map*
{
  let $default-limit := if ($merge-rule/@property-name = "Phone") then 5 else 10
  return fn:subsequence(
        $properties,
        fn:head(($merge-rule/@max-values, 5))
  )
};

Example XQuery custom merge after conversion:

Hub Central Example:

declare function algorithm:custom-merge-limit(
  $property-name as xs:QName,
  $properties as map:map*,
  $merge-rule as object-node()
) as map:map*
{
  let $default-limit := if ($merge-rule/entityPropertyPath = "Phone") then 5 else 10
  return fn:subsequence(
        $properties,
        fn:head(($merge-rule/maxValues, $default-limit))
  )
};

Example Javascript custom merge before conversion:

QuickStart Example:

function customMergeLimit(propertyName, properties, mergeRule)
{
  let defaultLimit = mergeRule.propertyName === "Phone" ? 5 : 10;
  return fn.subsequence(
        properties,
        mergeRule.maxValues || defaultLimit
  );
}

Example JavaScript custom merge after conversion:

Hub Central Example:

function customMergeLimit(propertyName, properties, mergeRule)
{
  let defaultLimit = mergeRule.entityPropertyPath === "Phone" ? 5 : 10;
  return fn.subsequence(
        properties,
        mergeRule.maxValues || defaultLimit
  );
}