Getting Started

In this topic:

Overview

MarkLogic Data Hub makes it easier to ingest and integrate your data in MarkLogic Server databases. You can host your data in MarkLogic Data Hub Service (AWS or Azure), your own cloud servers (AWS or Azure), or your own on-premises servers.

To prepare your data for consumption, you must perform the following processes by creating default or custom steps, adding the steps to flows, and executing the flows.

Process Default step type Custom step type
(HC format)
Custom step type
(QS format)
1. Load. Ingest your raw data. Ingestion Custom-Ingestion Custom-Ingestion
2. Model. Define entity models that the data must be converted to.
3. Map. Convert the data to match the models that you defined. Mapping Custom Custom-Mapping
4. Match. Check for potential duplicates in your data. Matching Custom Custom-Mastering
5. Merge. Combine the confirmed duplicates. Merging Custom Custom-Mastering
Perform other processes anywhere in the flow. Custom Custom-Other

The numbered processes in the following table must be performed in sequence. Each subsequent process is dependent on the output of the previous process. However, you can insert custom steps to perform additional processing on the data at the start of the flow, at the end of the flow, or between the numbered steps. You can also use interceptors and custom hooks to call custom modules before or after the step's core processes.

Learn more:

Choose Your Tools

You can choose from multiple tools to process your data. MarkLogic provides graphical user interfaces (Hub Central), as well as Gradle tasks, Java APIs, a client JAR, REST APIs, and the MarkLogic Content Pump (MLCP).

Different tools work in different environments.

Environment On-Premises In Data Hub Service
Development and Test Use Hub Central. Use Hub Central.

Flow and step configuration structures must be in the HC format.

Production Use Gradle, other non-GUI tools, or Hub Central.

Flow and step configuration structures must be in the HC format.

Tools and Workflow for Data Hub Service Environments

The Hub Central workflow:

  1. Load your data.
  2. Create an entity model.
  3. Curate your data.
  4. Perform custom processing on your data.
  5. Explore your data.

Tools and Workflow for On-Premises Environments

The on-premises tools can be categorized into three tracks:

  • The GUI track (recommended) provides intuitive graphical interfaces to work with your data. This track is intended for beginners and non-technical business users, who only need the default functionality with minimal customization.
    • Hub Central provides much of the same functionality available in the non-GUI tools, and it supports viewing custom steps.
      Note: Hub Central does not support custom step creation.
  • The command-line track helps you to quickly automate your most common processes, bypassing the GUI. This track is intended for advanced users, who need greater flexibility with customization and orchestration.
    • An extensive set of Gradle tasks is provided to automate the creation of artifacts and the execution of flows. Gradle is ideal in a continuous integration environment, where creation, testing, and execution are performed multiple times.
    • The executable Data Hub Client JAR is the ideal alternative to running a flow in a production environment where Gradle and the Data Hub project files are not available. It only requires a JVM.
  • The programming track provides the APIs you can use to create apps that run flows to manage and use your data. This track is intended for advanced users, who need greater flexibility with customization and orchestration.
    • The Data Hub Java API is provided for running flows in your own Java-based apps or in an external orchestration system that supports Java-based extensions.
    • MLCP and the Data Hub REST extensions provide alternatives to ingesting data into the STAGING database. You can also ingest directly into the FINAL database if you intend to serve the data without curation or other processing.
    • The REST Client API provides some record-management and job-information-retrieval capabilities.

You can switch between tracks or between tools for different tasks; however, switching from the command-line track to the GUI track might be less convenient because the GUI handles some processes automatically.

The following table organizes the tasks you can run with the tools in each track:

Note: To begin using Hub Central in an on-premises environment, see Create Project and Access Hub Central
Task GUI Track Command-Line Track Programming Track
Create Project Using Gradle
Set Security Credentials Using Gradle
Create Entity

(required for mapping step)

Using Gradle
Create Flow Using Gradle
Create Step
Create Mapping Mapping Step Manually
Manage Steps in a Flow Manually
Run Flow
Merge Records Outside a Flow Using Gradle Using REST Client API
Unmerge a Record Outside a Flow Using Gradle Using REST Client API
Deploy to Data Hub Service
Redeploy Using Gradle