Data Hub Extensions to the REST Client API

This page provides the list of Data Hub REST Client APIs that extend the MarkLogic REST Client API.

Administration

mlHubversion (GET)

Returns the version of the Data Hub installed in your MarkLogic Server instance.

GET /v1/resources/mlHubversion
mlDebug (GET)

Returns true if debugging is currently enabled for Data Hub in the MarkLogic Server instance; otherwise, false.

GET /v1/resources/mlDebug
mlDebug (POST)

POST /v1/resources/mlDebug?rs:enable=[true|false]
rs:enable
(Required) If true, enables debugging for Data Hub in the MarkLogic Server instance; otherwise, debugging is disabled. The default is false.

Record Management

MarkLogic Data Hub provides a REST Client API extension which allows you to match and merge/unmerge records programmatically without running a flow.

mlSmMatch (POST)

Compares the specified record with other records and returns the list of possible matches.

POST /v1/resources/mlSmMatch?rs:uri=URIofFocusRecord&rs:flowName=YourFlowName&rs:step=1&rs:includeMatchDetails=[true|false]&rs:start=1&rs:pageLength=10
rs:uri
(Required) The URI of the record to compare with other records.
rs:flowName
(Required) The name of a flow that includes a mastering step.
rs:step
The step number of the mastering step in the specified flow. This task uses the settings in the mastering step. The default is 1, which assumes that the first step in the flow is a mastering step.
rs:includeMatchDetails
If true, additional information about each positive match is provided. The default is false.
rs:start
The index of the first notification to return. The default is 1.
rs:pageLength
The number of notifications to return. The default is 20.
mlSmMerge (POST)

Merges the specified records according to the settings of the specified mastering step.

POST /v1/resources/mlSmMerge?rs:uri=URI1&rs:uri=URI2&rs:uri=URIn&rs:flowName=YourFlowName&rs:step=1&rs:preview=[true|false]
rs:uri
(Required) The URI of one of the records to merge. You must specify at least two URIs.
rs:flowName
(Required) The name of a flow that includes a mastering step.
rs:step
The step number of the mastering step in the specified flow. This task uses the settings in the mastering step. The default is 1, which assumes that the first step in the flow is a mastering step.
rs:preview
If true, no changes are made to the database and a simulated merged record is returned; otherwise, the merged record is saved to the database. The default is false.
mlSmMerge (DELETE)

Reverses the set of merges that created the specified merged record.

DELETE /v1/resources/mlSmMerge?rs:mergeURI=URIofMergedRecord&rs:retainAuditTrail=[true|false]&rs:blockFutureMerges=[true|false]
rs:mergeURI
(Required) The URI of the record to unmerge.
rs:retainAuditTrail
If true, the merged record will be moved to an archive collection; otherwise, it will be deleted. The default is true.
rs:blockFutureMerges
If true, the component records will be blocked from being merged together again. The default is true.
Note: This task archives or deletes the specified merged record and unarchives the component records that were combined to create it. If one of the component records is itself a merged record, the component record will remain so.
mlSmNotifications (GET)

Returns the list of notifications about matches that are close to but did not exceed the merging threshold.

GET /v1/resources/mlSmNotifications?rs:start=1&rs:pageLength=10
rs:start
The index of the first notification to return. The default is 1.
rs:pageLength
The number of notifications to return. The default is 10.
mlSmNotifications (POST)

Returns specific values from the list of notifications about matches that are close to but did not exceed the merging threshold.

POST /v1/resources/mlSmNotifications?rs:start=1&rs:pageLength=10
rs:start
The index of the first notification to return.
rs:pageLength
The number of notifications to return.
The body of the request must contain a JSON object that specifies the notification values to return. Format:
   { "key": "fieldname" }
Returns the specified values in the format:
   { ...
    extractions: { "/path-to-uri.xml": { "key": "value-of-field" } }
  }
Example: If the notification /uri1.xml contains:
   <Person>
    <PersonFirstName>Bob</PersonFirstName>
    <PersonLastName>Smith</PersonLastName>
  </Person>
And the body of the POST request contains:
   { "firstName": "PersonFirstName" }
The results include an extractions node as follows:
   { ...
    extractions: {
      "/uri1.xml": { "firstName": "Bob" }
    }
  }
mlSmNotifications (PUT)

Returns the list of notifications about matches that are close to but did not exceed the merging threshold.

PUT /v1/resources/mlSmNotifications?rs:uris=array-of-URIs-of-notifications-to-update&rs:status=read
rs:uris
An array of strings containing the URIs of notifications to update.
rs:status
The new status of the notifications. Valid values are read and unread.
mlSmNotifications (DELETE)

Deletes the specified notification.

DELETE /v1/resources/mlSmNotifications?rs:uri=uri-of-notification-to-delete
rs:uri
The URI of the notification to delete.
mlSmHistoryDocument (GET)

Returns the document-level history of the specified merged record.

GET /v1/resources/mlSmHistoryDocument?rs:uri=URIofMergedRecord
rs:uri
(Required) The URI of a merged record.
mlSmHistoryProperties (GET)

Returns the history of the specified property or all properties of a merged record.

GET /v1/resources/mlSmHistoryProperties?rs:uri=URIofMergedRecord&rs:property=YourPropertyName
rs:uri
(Required) The URI of a merged record.
rs:property
The name of the specific property. The default is all properties.
Note: Only document-level provenance is tracked by default. To track property-level provenance, you must set "provenanceGranularityLevel" : "fine". See Set Provenance Granularity Manually

Job Management

mlJobs (GET)

Returns job information based on the specified parameters.

GET /v1/resources/mlJobs?rs:jobid=YourJobID&rs:status=&rs:flowNames=YourFlowName&rs:flow-name=YourFlowName
rs:jobid
A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned. Used to return the job document associated with the specified job ID. You can specify either jobID or status, but not both.
rs:status
The status of the job: started, finished, finished_with_errors, running, failed, stop-on-error, or canceled. Used to return the list of job documents associated with all jobs with the specified status. You can specify either jobID or status, but not both.
rs:flowNames
The name of the flow. Used to return the job ID and job information of the latest run that includes the specified flow name. To specify additional flow names, repeat the parameter.
rs:flow-name
The name of the flow. Used to return the list of job documents associated with the all runs that include the specified flow name.

Data Hub 5.5.0 enables enterprise-wide monitoring tools to debug and analyze data pipelines across architectures. Tools like AWS cloudwatch, Splunk, and many others observe data pipelines. Data Hub enables you to see the results of the execution of data hub pipelines/activities in enterprise-wide monitoring tools.

To do so, you send a GET request to the REST-based Data Hub extension to get jobs data with parameters that include (jobID) or (status, flow names) and parameters like start time(range), end time(range), step name, and step type.

You receive the job data filtered based on the parameters sent by the GET request with all the required attributes, present in the jobs schema, including duration of a step.

The default pageLength caps results returned (e.g., 100)

The hubJobsREST endpoint returns Job documents.

For details on the existing mlJobsREST extension, see:

https://docs.marklogic.com/datahub/5.4/tools/rest/rest-extensions.html#rest-extensions__section-job-management

Parameters for querying /v1/resources/hubJobs

Parameter Name Description
rs:start The page to start results from. (Default: 1)
rs:pageLength The max number of jobs to return on a page. (Default: 100)
rs:jobId The ID for the job.
rs:jobStatus The status of the jobs information to return. One of the following: [started, finished, finished_with_errors, running, failed, stop-on-error, canceled]
rs:user The username of the user that ran the job.
rs:flowName The name of the flow run by the job.
rs:stepName The name of a step that is run by a job.
rs:stepDefinitionType The step definition type of a step run by the job. One of the following: [ingestion, mapping, matching, merging, custom]
rs:startTimeBegin The time in ISO dateTime format that a startTime for a step run must be greater than or equal to.
rs:startTimeEnd The time in ISO dateTime format that a startTime for a step run must be less than or equal to.
rs:endTimeBegin The time in ISO dateTime format that a endTime for a step run must be greater than or equal to.

mlBatches (GET)

Returns the batch documents for the specified step or batch within the specified job.

GET /v1/resources/mlBatches?rs:jobid=YourJobID&rs:step=1&rs:batchid=YourBatchID
rs:jobid
(Required) A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.
rs:step
(Required) The sequence number of the step whose batch documents to return. You must specify either step or batchId, but not both.
rs:batchid
(Required) The ID of the batch whose documents to return. You must specify either step or batchId, but not both.

Transforms

mlRunIngest (POST)

Ingests your data using the specified ingestion step in the specified flow.

POST /v1/documents?transform=mlRunIngest&trans:flow-name=YourFlowName&trans:step=1&trans:job-id=YourJobID&trans:options={}
flow-name
(Optional) The name of the flow. The default is default-ingestion.
step
(Optional if flow-name is not specified) The sequence number of the ingestion step to execute. If flow-name is not specified, the default is 1.
job-id
(Optional) A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.
options
(Optional) A JSON object containing additional options.

To add the CSV filename to the header of the envelope, include "inputFileType" : "csv" in the JSON object.

To override the values of step properties at runtime, include "propertyname" : "valueToUse" in the JSON object.

The step properties whose values can be overridden at runtime are:

  • outputFormat with a value of text, json, xml, or binary.
  • provenanceGranularityLevel with a value of coarse, fine, or off.
  • disableJobOutput with a value of true or false.

In most cases, you can use the default flow default-ingestion.

Note: If you use curl to call mlRunIngest, the permissions and collections properties are not set in the resulting documents. Therefore, you must set these properties using a custom step after ingestion.

Example:

   http://localhost:8010/v1/documents?transform=mlRunIngest&
    trans:flow-name=CurateCustomerJSON&
    trans:step=1&
    trans:options={"headers":{"sources":[{"name":"loadCustomersJSON"}],"createdOn":"currentDateTime","createdBy":"currentUser"},"outputFormat":"xml"}

See Ingestion Step Settings and Provenance and Lineage