QuickStart Tutorial for Data Hub 5.x

Overview

In this tutorial, you will integrate data from two data sources with different field names. To do so, you will:
  • define an entity model to standardize the data structure, and
  • assemble multiple flows with various steps to do the following:
    • ingest the data from the two different source datasets (ingestion step),
    • map the source data fields with the properties of the entity model (mapping step), and
    • identify and merge duplicate records (mastering step).
This tutorial uses the sample data files from the Sunrise Insurance example.
  • The Advantage dataset is comprised of JSON files.
  • The Bedrock dataset is comprised of a single CSV file.
You will populate the new database with the two sets of customer data.

Most of the customer data fields are essentially the same; however, the field names are different. For example, Advantage has FirstName, whereas Bedrock has first_name. In addition, Advantage has a Phone field, but Bedrock does not have an equivalent field. You will define the entity model which will serve as the standard, and then map the source fields to the entity properties.

Finally, one or more records in the combined set might be duplicates. Therefore, you will set up a mastering step that defines the criteria for what is considered a duplicate and the actions to be taken if those criteria (thresholds) are met (exceeded).

Before you begin

You need:

About this task

The following steps provide links to detailed instructions in the documentation, if available. However, use the values specified in each step below for the settings.

Procedure

You must first create a project before you can create any entities or flows. Creating a project also installs the Data Hub application.
  1. Download files.
    1. Create a directory to serve as your project's root.
    2. Download the latest Data Hub QuickStart .war file to your project's root.
    3. Download and unzip the data files for the Sunrise Insurance example.
    4. Copy the datasets directory from the expanded insurance example into your project's root directory.

    Example: If your project's root directory is Q:\data-hub, the folder structure will look as follows:

       Q:\data-hub
      │   marklogic-datahub-5.0.4.war
      └───datasets
          ├───advantage
          │       ...
          └───bedrock
                  ...
    
  2. Create a project.
An entity is the standardized set of properties of your data. An entity definition is required for the mapping step.
  1. In QuickStart, create an entity.
    1. Set Title to Customer.
    2. Add properties to the entity.
      Name Type Cardinality
      id string 1..1
      firstname string 1..1
      lastname string 1..1
      postal string 1..1
      phone string 1..1
      email string 1..1
      pin int (integer) 1..1
      updated dateTime 1..1
    3. Click Save.
    4. If prompted to update the index, click No.
      At this point, the database does not include any significant data; therefore, reindexing is not needed yet.

    Customer entity

In the next section, you will create and run one flow for each source dataset. Each flow will contain an ingestion step and a mapping step.
  1. Ingest and map the Advantage data.
    1. Create a flow.
      • Set Flow Name to AdvantageFlow.
      • Leave other settings as is.
    2. Create an ingestion step.
      • Set Name to AdvantageIngest.
      • Leave other settings as is.
    3. Configure the ingestion step.
      Set the fields as follows:
      Field Value
      Source Directory Path datasets/advantage
      Source Format JSON
      Target Format JSON
      Target Permissions rest-reader,read,rest-writer,update
      Target URI Replacement .*datasets,''
    4. Run the flow to ingest the data.

      The STAGING database will contain 100 JSON documents in the AdvantageIngest collection.


      100 JSON documents in the AdvantageIngest collection

    5. Create a mapping step.
      Set the fields as follows:
      Field Value
      Name AdvantageMap
      Source Type Collection
      Source Collection AdvantageIngest
      Target Entity Customer
    6. Map the source fields to the entity properties.
      Source Fields Entity Properties
      CustomerID id
      FirstName firstname
      LastName lastname
      Postal postal
      Phone phone
      Email email
      PIN pin
      Updated updated
      View the mapping page.

      Mapping the Advantage dataset fields against the entity properties.

    7. Run the flow, but select only the mapping step.
      Because the flow includes both the ingestion step and the mapping step, running the complete flow will ingest the same documents again. If you have not made changes to any of the documents, then you can safely rerun the ingestion. The second run will simply overwrite the documents ingested in the first run.

      The FINAL database will contain 100 mapped documents in the AdvantageMap collection. Each document or record will have an instance of the Customer entity.


      100 JSON documents in the AdvantageMap collection

      In the resulting record:
      • $.envelope.instance contains the mapped information, using the Customer entity's properties.
      • $.envelope.attachments.envelope.instance contains the original ingested data, using the source field names.
      View the contents of an example mapped record.
         {
          "envelope": {
            "headers": {...},
            "triples": [],
            "instance": {
              "Customer": {
                "id": "0881e206-a488-4f6d-a09b-ff18ed412998",
                "firstname": "Oneal",
                "lastname": "Banks",
                "postal": "43130-7986",
                "phone": "(913) 462-2899",
                "email": "onealbanks@comvex.com",
                "pin": 6880,
                "updated": "2018-02-10T19:54:20"
              },
              "info": {...}
            },
            "attachments": {
              "envelope": {
                "headers": {...},
                "triples": [],
                "instance": {
                  "ObjectID": {
                    "$oid": "5cd0da4d857d8461bec88893"
                  },
                  "CustomerID": "0881e206-a488-4f6d-a09b-ff18ed412998",
                  "FirstName": "Oneal",
                  "LastName": "Banks",
                  "Email": "onealbanks@comvex.com",
                  "Postal": "43130-7986",
                  "Phone": "(913) 462-2899",
                  "PIN": 6880,
                  "Updated": "2018-02-10T19:54:20"
                },
                "attachments": null
              }
            }
          }
        }
      
  2. Ingest and map the Bedrock data.
    1. Create a flow.
      • Set Flow Name to BedrockFlow.
      • Leave other settings as is.
    2. Create an ingestion step.
      • Set Name to BedrockIngest.
      • Leave other settings as is.
    3. Configure the ingestion step.
      Set the fields as follows:
      Field Value
      Source Directory Path datasets/bedrock
      Source Format Delimited Text
      Field Separator ,
      Target Format JSON
      Target Permissions rest-reader,read,rest-writer,update
      Target URI Replacement .*datasets,''
    4. Run the flow to ingest the data.

      The STAGING database will contain 100 JSON documents in the BedrockIngest collection.


      100 JSON documents in the AdvantageIngest collection

    5. Create a mapping step.
      Set the fields as follows:
      Field Value
      Name BedrockMap
      Source Type Collection
      Source Collection BedrockIngest
      Target Entity Customer
    6. Map the source fields to the entity properties.
      Source Fields Entity Properties
      id id
      first_name firstname
      last_name lastname
      zip postal
      (Leave blank.) phone
      email email
      pin pin
      last_updated updated
    7. Run the flow, but select only the mapping step.

      The FINAL database will contain 100 mapped documents in the BedrockMap collection. Each document or record will have an instance of the Customer entity.


      100 JSON documents in the BedrockMap collection

In the following section, you will remove possible duplicates in the ingested documents. To do so, you need to create a mastering step. You can add it to an existing flow, but in this exercise, you'll create a new flow with only one step. See About Mastering.
  1. Find and merge duplicate records.
    1. Create a flow.
      • Set Flow Name to CustomerMastering.
      • Leave other settings as is.
    2. Create a mastering step.
      Set the fields as follows:
      Field Value
      Name CustomerMaster
      Source Type Query
      Source Query cts.collectionQuery(["AdvantageMap", "BedrockMap"])
      Note: A query is needed so that documents from both datasets can be mastered together.
      Target Entity Customer
    3. Configure the mastering step.

      The mastering step is comprised of two sets of options:

      • The match options specify how to compare two records. Assigned weights indicate the degree of similarity between the records.
      • The merge options specify how to combine two or more records when the sum of the weights exceed a threshold.
      • Add the following match options:
        Match Type Property to Match Weight
        Exactfirstname5
        Exactlastname10
        Exactemail20
      • Add the following match threshold:
        Field Value
        Name Definite
        Weight Threshold 25
        Action Merge
        The weights are used to produce a score that is compared with the match thresholds. In this example, if the comparison score meets or exceeds 25 (Weight Threshold), the matching records will be merged (Action). Based on the weights assigned to the match options, this threshold can be exceeded if:
        • the email matches and the firstname matches (weights: 20 + 5 = 25)
        • the email matches and the lastname matches (weights: 20 + 10 = 30)

      Match options

      • Add the following merge options:
        Merge Type Property Max Values Length Weight
        Standard postal 1 10
        Standard id 1
        Standard updated 1
        These merge options ensure that a merged record would have only one value for each of the specified properties. The length weight for the postal property causes the longer 9-digit postal codes to be preferred over the 5-digit postal codes.
      • Set Timestamp to /envelope/instance/Customer/updated.
        This sorts the matching records in reverse chronological order (newest first) based on the value of the property found at the given timestamp path.
      • Add the following merge collection:
        Event Collections to Set
        onMerge customer-merge

      Merge options

    4. Run the flow to merge matching records.
      Two pairs of customer entities will be merged together, and the merged records will be placed in the customer-merge collection.
      Mastering results

      Each of the merged records would have a $.envelope.headers.merges section containing the URIs of the original matching records.

      Example merges section of a merged record
         "merges": [
          {
            "document-uri": "/advantage/cust14.json",
            "last-merge": "2019-09-10T15:42:03.2017696-07:00"
          },
          {
            "document-uri": "/bedrock/c2288315-bb99-400d-afdb-554ef40117d3.json",
            "last-merge": "2019-09-10T15:42:03.2017696-07:00"
          }
        ],