Getting Started
In this topic:
Overview
MarkLogic Data Hub makes it easier to ingest and integrate your data in MarkLogic Server databases. You can host your data in MarkLogic Data Hub Service (AWS or Azure), your own cloud servers (AWS or Azure), or your own on-premises servers.
To prepare your data for consumption, you must perform the following processes by creating default or custom steps, adding the steps to flows, and executing the flows.
Process | Default step type | Custom step type (HC format) |
Custom step type (QS format) |
---|---|---|---|
1. Load. Ingest your raw data. | Ingestion | Custom-Ingestion | Custom-Ingestion |
2. Model. Define entity models that the data must be converted to. | |||
3. Map. Convert the data to match the models that you defined. | Mapping | Custom | Custom-Mapping |
4. Match. Check for potential duplicates in your data. | Matching | Custom | Custom-Mastering |
5. Merge. Combine the confirmed duplicates. | Merging | Custom | Custom-Mastering |
Perform other processes anywhere in the flow. | Custom | Custom-Other |
The numbered processes in the following table must be performed in sequence. Each subsequent process is dependent on the output of the previous process. However, you can insert custom steps to perform additional processing on the data at the start of the flow, at the end of the flow, or between the numbered steps. You can also use interceptors and custom hooks to call custom modules before or after the step's core processes.
Learn more:
Choose Your Tools
You can choose from multiple tools to process your data. MarkLogic provides graphical user interfaces (Hub Central), as well as Gradle tasks, Java APIs, a client JAR, REST APIs, and the MarkLogic Content Pump (MLCP).
Different tools work in different environments.
Environment | On-Premises | In Data Hub Service |
---|---|---|
Development and Test | Use Hub Central. | Use Hub Central.
Flow and step configuration structures must be in the HC format. |
Production | Use Gradle, other non-GUI tools, or Hub Central.
Flow and step configuration structures must be in the HC format. |
Tools and Workflow for Data Hub Service Environments
The Hub Central workflow:
Tools and Workflow for On-Premises Environments
The on-premises tools can be categorized into three tracks:
- The GUI track (recommended) provides intuitive graphical interfaces to work with your data. This track is intended for beginners and non-technical business users, who only need the default functionality with minimal customization.
- Hub Central provides much of the same functionality available in the non-GUI tools, and it supports viewing custom steps.
Note: Hub Central does not support custom step creation.
- Hub Central provides much of the same functionality available in the non-GUI tools, and it supports viewing custom steps.
- The command-line track helps you to quickly automate your most common processes, bypassing the GUI. This track is intended for advanced users, who need greater flexibility with customization and orchestration.
- An extensive set of Gradle tasks is provided to automate the creation of artifacts and the execution of flows. Gradle is ideal in a continuous integration environment, where creation, testing, and execution are performed multiple times.
- The executable Data Hub Client JAR is the ideal alternative to running a flow in a production environment where Gradle and the Data Hub project files are not available. It only requires a JVM.
- The programming track provides the APIs you can use to create apps that run flows to manage and use your data. This track is intended for advanced users, who need greater flexibility with customization and orchestration.
- The Data Hub Java API is provided for running flows in your own Java-based apps or in an external orchestration system that supports Java-based extensions.
- MLCP and the Data Hub REST extensions provide alternatives to ingesting data into the STAGING database. You can also ingest directly into the FINAL database if you intend to serve the data without curation or other processing.
- The REST Client API provides some record-management and job-information-retrieval capabilities.
You can switch between tracks or between tools for different tasks; however, switching from the command-line track to the GUI track might be less convenient because the GUI handles some processes automatically.
The following table organizes the tasks you can run with the tools in each track:
Task | GUI Track | Command-Line Track | Programming Track |
---|---|---|---|
Create Project | Using Gradle | ||
Set Security Credentials | Using Gradle | ||
Create Entity (required for mapping step) |
Using Gradle | ||
Create Flow | Using Gradle | ||
Create Step |
|
||
Create Mapping | Mapping Step | Manually | |
Manage Steps in a Flow | Manually | ||
Run Flow | |||
Merge Records Outside a Flow | Using Gradle | Using REST Client API | |
Unmerge a Record Outside a Flow | Using Gradle | Using REST Client API | |
Deploy | to Data Hub Service | ||
Redeploy | Using Gradle |