Getting Started

In this topic:

Overview
Choose Your Tools

Overview

MarkLogic Data Hub makes it easier to ingest and integrate your data in MarkLogic Server databases. You can host your data in MarkLogic Data Hub Service (AWS or Azure), your own cloud servers (AWS or Azure), or your own on-premises servers.

To prepare your data for consumption, you must perform the following processes by creating default or custom steps, adding the steps to flows, and executing the flows.


Process	Default step type	Custom step type (HC format)	Custom step type (QS format)
1. Load. Ingest your raw data.	Ingestion	Custom-Ingestion	Custom-Ingestion
2. Model. Define entity models that the data must be converted to.
3. Map. Convert the data to match the models that you defined.	Mapping	Custom	Custom-Mapping
4. Match. Check for potential duplicates in your data.	Matching	Custom	Custom-Mastering
5. Merge. Combine the confirmed duplicates.	Merging	Custom	Custom-Mastering
Perform other processes anywhere in the flow.		Custom	Custom-Other

The numbered processes in the following table must be performed in sequence. Each subsequent process is dependent on the output of the previous process. However, you can insert custom steps to perform additional processing on the data at the start of the flow, at the end of the flow, or between the numbered steps. You can also use interceptors and custom hooks to call custom modules before or after the step's core processes.

Learn more:

Choose Your Tools

You can choose from multiple tools to process your data. MarkLogic provides graphical user interfaces (Hub Central), as well as Gradle tasks, Java APIs, a client JAR, REST APIs, and the MarkLogic Content Pump (MLCP).

Different tools work in different environments.


Environment	On-Premises	In Data Hub Service
Development and Test	Use Hub Central.	Use Hub Central. Flow and step configuration structures must be in the HC format.
Production	Use Gradle, other non-GUI tools, or Hub Central. Flow and step configuration structures must be in the HC format.

Tools and Workflow for Data Hub Service Environments

The Hub Central workflow:

Tools and Workflow for On-Premises Environments

The on-premises tools can be categorized into three tracks:

The GUI track (recommended) provides intuitive graphical interfaces to work with your data. This track is intended for beginners and non-technical business users, who only need the default functionality with minimal customization.
- Hub Central provides much of the same functionality available in the non-GUI tools, and it supports viewing custom steps.
  Note: Hub Central does not support custom step creation.
The command-line track helps you to quickly automate your most common processes, bypassing the GUI. This track is intended for advanced users, who need greater flexibility with customization and orchestration.
- An extensive set of Gradle tasks is provided to automate the creation of artifacts and the execution of flows. Gradle is ideal in a continuous integration environment, where creation, testing, and execution are performed multiple times.
- The executable Data Hub Client JAR is the ideal alternative to running a flow in a production environment where Gradle and the Data Hub project files are not available. It only requires a JVM.
The programming track provides the APIs you can use to create apps that run flows to manage and use your data. This track is intended for advanced users, who need greater flexibility with customization and orchestration.
- The Data Hub Java API is provided for running flows in your own Java-based apps or in an external orchestration system that supports Java-based extensions.
- MLCP and the Data Hub REST extensions provide alternatives to ingesting data into the STAGING database. You can also ingest directly into the FINAL database if you intend to serve the data without curation or other processing.
- The REST Client API provides some record-management and job-information-retrieval capabilities.

You can switch between tracks or between tools for different tasks; however, switching from the command-line track to the GUI track might be less convenient because the GUI handles some processes automatically.

The following table organizes the tasks you can run with the tools in each track:

Note: To begin using Hub Central in an on-premises environment, see Create Project and Access Hub Central


Task	GUI Track	Command-Line Track	Programming Track
Create Project	Using Gradle	Using Gradle
Set Security Credentials		Using Gradle
Create Entity (required for mapping step)	Using Hub Central	Using Gradle
Create Flow	Using Hub Central	Using Gradle
Create Step	Using Hub Central Loading Step Mapping Step Matching Step Merging Step Custom Step	Using Gradle
Create Mapping	Mapping Step	Manually
Manage Steps in a Flow	Using Hub Central	Manually
Run Flow	Using Hub Central	Using Gradle Using Data Hub Client JAR Using MLCP (for ingestion only)	Using Data Hub API Using REST (for ingestion only)
Merge Records Outside a Flow		Using Gradle	Using REST Client API
Unmerge a Record Outside a Flow		Using Gradle	Using REST Client API
Deploy	to Data Hub Service
Redeploy		Using Gradle