Deploy to Data Hub Service

Data Hub Service

You can deploy your Data Hub project in the cloud instead of setting up your own. The Data Hub Service (DHS) is a cloud-based solution that provides a preconfigured MarkLogic cluster in which you can run flows and from which you can serve harmonized data.

Use MarkLogic Data Hub to develop and test your project locally (your development environment) then deploy it to a DHS cluster (your production environment).

Tip: You can have multiple services that use the same Data Hub project files. For example, you can set up a DHS project as a testing environment and another as your production environment.

In a DHS environment, the databases, app servers, and security roles are automatically set up. Admins can create user accounts.

To learn more about Data Hub Services (DHS), see Data Hub Services and the DHS documentation.

The following configurations might be different between Data Hub projects and DHS projects:

  • Roles — The DHS roles are automatically created as part of provisioning your DHS environment.
    Data HubDHS
    data-hub-admin
    endpointDeveloper
    endpointUser
    flow-developerflowDeveloper
    flow-operatorflowOperator
  • Database names — If database names are customized in the Data Hub environment, they might be different.
  • Gradle settings — The gradle.properties file contains some DHS-only settings, including mlIsHostLoadBalancer and mlIsProvisionedEnvironment, which are set to true to enable Data Hub to work correctly in DHS.

The following configurations are the same between Data Hub projects and DHS projects:

  • Ports and load balancers for app servers
    app serversportsDHS load balancers
    staging8010curation
    final8011operations
    jobs8013analytics
    Important: Use port 8004 to deploy the Data Hub Framework core only. To deploy custom plugins (REST extensions, search options, etc.) against the FINAL database, use port 8011.
If your endpoints are private, you need a bastion host inside a virtual private cloud (VPC) that can access the MarkLogic VPC. The bastion host securely relays:
  • the requests from the outside world to MarkLogic
  • the results from MarkLogic to the requester

If your endpoints are publicly available, you can use any machine that is set up as a peer of the MarkLogic VPC.

Important: The Data Hub QuickStart tool cannot be used in DHS.

Before you begin

  • A Data Hub project that has been set up and tested locally
  • A provisioned MarkLogic Data Hub Service environment
    Important: You must contact Support to upgrade your DHS environment to use Data Hub v5.0.
    • For private endpoints only: A bastion host inside a virtual private cloud (VPC)
    • Information from your DHS administrator:
      • Your DHS host name (typically, the curation endpoint)
      • REST curation endpoint URL (including port number) for testing
      • The username and password of the user account associated with each of the following roles. (See Creating a User.)
        • endpointDeveloper
        • endpointUser
        • flowDeveloper
        • flowOperator

Procedure

  1. Copy your entire Data Hub project directory to the machine from which you will access the endpoints, and perform the following steps on that machine.
    Important: If your endpoints are private, this machine must be a bastion host.
  2. Open a command-line window, and navigate to your Data Hub project root directory.
  3. At your project root, create a new gradle-DHS.properties file.
    Note: If you use a different name for the properties file,
    • The filename must be in the format gradle-{env}.properties, where {env} is any string you want to represent an environment. For example, you can store the settings for your development environment in gradle-dev.properties.
    • Remember to update the value of the -PenvironmentName parameter to {env} in the Gradle commands in the following steps.
    1. Copy the following code to the new file.
         mlDHFVersion=YOUR_DHF_VERSION
        mlHost=YOUR_DHS_HOSTNAME
      
        mlIsHostLoadBalancer=true
      
        mlUsername=YOUR_FLOW_OPERATOR_USER
        mlPassword=YOUR_FLOW_OPERATOR_PASSWORD
        mlManageUsername=YOUR_FLOW_DEVELOPER_USER
        mlManagePassword=YOUR_FLOW_DEVELOPER_PASSWORD
      
        mlStagingAppserverName=data-hub-STAGING
        mlStagingPort=8010
        mlStagingDbName=data-hub-STAGING
        mlStagingForestsPerHost=1
      
        mlFinalAppserverName=data-hub-FINAL
        mlFinalPort=8011
        mlFinalDbName=data-hub-FINAL
        mlFinalForestsPerHost=1
      
        mlJobAppserverName=data-hub-JOBS
        mlJobPort=8013
        mlJobDbName=data-hub-JOBS
        mlJobForestsPerHost=1
      
        mlModulesDbName=data-hub-MODULES
        mlStagingTriggersDbName=data-hub-staging-TRIGGERS
        mlStagingSchemasDbName=data-hub-staging-SCHEMAS
      
        mlFinalTriggersDbName=data-hub-final-TRIGGERS
        mlFinalSchemasDbName=data-hub-final-SCHEMAS
      
        mlModulePermissions=flowDeveloper,read,flowDeveloper,execute,flowDeveloper,insert,flowOperator,read,flowOperator,execute,flowOperator,insert
      
        mlIsProvisionedEnvironment=true
      
    2. Replace the values.
      KeyReplace the value with ...
      mlDHFVersionThe DHF version to use in your production environment.
      mlHostThe name of your DHS host.
      Tip: The host name is the domain name of the DHS final endpoint (remove http:// and the : and port number from the endpoint URL).
      • mlUsername
      • mlPassword
      The username and password of the user account assigned to the flowOperator role.
      Note: This can also be a user account assigned to the flowDeveloper role if additional permissions are required.
      • mlManageUsername
      • mlManagePassword
      The username and password of the user account assigned to the flowDeveloper role.
      ml*DbNameThe names of the DHS databases, if customized.
      ml*AppserverNameThe names of the DHS app servers, if customized.
      ml*PortThe ports that your DHS project is configured with, if not the defaults.
  4. Install the Data Hub core modules.
    ./gradlew hubInstallModules -PenvironmentName=DHSgradlew.bat hubInstallModules -PenvironmentName=DHS
  5. Install the plugins for your project.
    ./gradlew mlLoadModules -PenvironmentName=DHSgradlew.bat mlLoadModules -PenvironmentName=DHS
  6. If you are using Data Hub 4.0.2 or later, load the indexes in the DHS databases.
    ./gradlew mlUpdateIndexes -PenvironmentName=DHSgradlew.bat mlUpdateIndexes -PenvironmentName=DHS
  7. Run a flow with an ingestion step.

    You can use any of the following:

  8. Run a flow with a mapping step and/or a mastering step.
    ./gradlew hubRunFlow -PflowName=your-flow-name -PentityName=your-entity-name -PenvironmentName=DHSgradlew.bat hubRunFlow -PflowName=your-flow-name -PentityName=your-entity-name -PenvironmentName=DHS
    Important: If the value of a Gradle parameter contains a blank space, you must enclose the value in double quotation marks. If the value does not contain a blank space, you must not enclose the value in quotation marks.
  9. Verify that your documents are in the databases.
    1. In the following URLs, replace OPERATIONS-REST-ENDPOINT-URL and CURATION-REST-ENDPOINT-URL with the appropriate endpoint URLs from your DHS administrator.
      Final databasehttp://OPERATIONS-REST-ENDPOINT-URL:8011/v1/search
      Staging databasehttp://CURATION-REST-ENDPOINT-URL:8010/v1/search

      Example: http://internal-mlaas-xxx-xxx-xxx.us-west-2.elb.amazonaws.com:8011/v1/search

      Tip: Narrow the search to return fewer items. See MarkLogic REST API Search.
    2. In a web browser, navigate to one of the URLs.
    The result is an XML list of all your documents in the database. Each item in the list includes the document's URI, path, and other metadata, as well as a preview of the content.

What to do next

If you update your flows after the initial project upload, you can redeploy your flow updates by running gradle mlLoadModules again and then running the flows.