Deploy to Data Hub Service

Data Hub Service

You can deploy your Data Hub project in the cloud instead of setting up your own. The Data Hub Service (DHS) is a cloud-based solution that provides a preconfigured MarkLogic cluster in which you can run flows and from which you can serve harmonized data.

Use MarkLogic Data Hub to develop and test your project locally (your development environment) then deploy it to a DHS cluster (your production environment).

Tip: You can have multiple services that use the same Data Hub project files. For example, you can set up a DHS project as a testing environment and another as your production environment.

In a DHS environment, the databases, app servers, and security roles are automatically set up. Admins can create user accounts.

To learn more about Data Hub Service (DHS), see Data Hub Service and the DHS documentation.

The following configurations might be different between Data Hub projects and DHS projects:

  • Roles — The DHS roles are automatically created as part of provisioning your DHS environment.
    Data HubDHS
    data-hub-admin
    endpointDeveloper
    endpointUser
    flow-developerflowDeveloper
    flow-operatorflowOperator

    See also: Data Hub Service Instance Security Roles

  • Database names — If database names are customized in the Data Hub environment, they might be different.
  • Gradle settings — The gradle.properties file contains some DHS-only settings, including mlIsHostLoadBalancer and mlIsProvisionedEnvironment, which are set to true to enable Data Hub to work correctly in DHS.

The following configurations are the same between Data Hub projects and DHS projects:

  • Ports and load balancers for app servers
    app serversportsDHS load balancers
    staging8010curation
    final8011operations
    jobs8013analytics
    Important: Use port 8004 to deploy the Data Hub Framework core only. To deploy custom plugins (REST extensions, search options, etc.) against the FINAL database, use port 8011.
If your endpoints are private, you need a bastion host inside a virtual private cloud (VPC) that can access the MarkLogic VPC. The bastion host securely relays:
  • the requests from the outside world to MarkLogic
  • the results from MarkLogic to the requester

If your endpoints are publicly available, you can use any machine that is set up as a peer of the MarkLogic VPC. See Create the Peer Role - AWS.

Important: The Data Hub QuickStart tool cannot be used in DHS.

Before you begin

  • A Data Hub project that has been set up and tested locally
  • A provisioned MarkLogic Data Hub Service environment
    Important: You must contact Support to upgrade your DHS environment to use Data Hub v5.0.
    • For private endpoints only: A bastion host inside a virtual private cloud (VPC)
    • Information from your DHS administrator:
      • Your DHS host name (typically, the curation endpoint)
      • REST curation endpoint URL (including port number) for testing
      • The username and password of the user account associated with each of the following roles. (See Creating a DHS Instance User Account.)
        • endpointDeveloper
        • endpointUser
        • flowDeveloper
        • flowOperator

Procedure

  1. Copy your entire Data Hub project directory to the machine from which you will access the endpoints, and perform the following steps on that machine.
    Important: If your endpoints are private, this machine must be a bastion host.
  2. Open a command-line window, and navigate to your Data Hub project root directory.
  3. Set up your gradle-dhs.properties file.
    1. Download the Gradle configuration file from your Data Hub Service instance to your project root.
      Note: By default, the downloaded file is named gradle-dhs.properties. If you use a different filename,
      • The filename must be in the format gradle-{env}.properties, where {env} is any string you want to represent an environment. For example, you can store the settings for your development environment in gradle-dev.properties.
      • Remember to update the value of the -PenvironmentName parameter to {env} in the Gradle commands in the following steps.
    2. Set the values for the usernames and passwords as indicated in the configuration file.
  4. Deploy your modules and update the indexes.
    ./gradlew dhsDeploy -PenvironmentName=dhs -igradlew.bat dhsDeploy -PenvironmentName=dhs -i
  5. Run a flow with an ingestion step.

    You can use any of the following:

  6. Run a flow with a mapping step and/or a mastering step.
    ./gradlew hubRunFlow -PflowName=your-flow-name -PentityName=your-entity-name -PenvironmentName=dhs -igradlew.bat hubRunFlow -PflowName=your-flow-name -PentityName=your-entity-name -PenvironmentName=dhs -i
    Important: If the value of a Gradle parameter contains a blank space, you must enclose the value in double quotation marks. If the value does not contain a blank space, you must not enclose the value in quotation marks.
  7. Verify that your documents are in the databases.
    1. In the following URLs, replace OPERATIONS-REST-ENDPOINT-URL and CURATION-REST-ENDPOINT-URL with the appropriate endpoint URLs from your DHS administrator.
      Final databasehttp://OPERATIONS-REST-ENDPOINT-URL:8011/v1/search
      Staging databasehttp://CURATION-REST-ENDPOINT-URL:8010/v1/search

      Example: http://internal-mlaas-xxx-xxx-xxx.us-west-2.elb.amazonaws.com:8011/v1/search

      Tip: Narrow the search to return fewer items. See MarkLogic REST API Search.
    2. In a web browser, navigate to one of the URLs.
    The result is an XML list of all your documents in the database. Each item in the list includes the document's URI, path, and other metadata, as well as a preview of the content.

What to do next

If you update your flows after the initial project upload, you can redeploy your flow updates by running gradle dhsDeploy again and then running the flows.