Data Hub Extensions to the REST Client API
This page provides the list of Data Hub REST Client APIs that extend the MarkLogic REST Client API.
Administration
Returns the version of the Data Hub installed in your MarkLogic Server instance.
Returns true
if debugging is currently enabled for Data Hub in the MarkLogic Server instance; otherwise, false
.
- rs:enable
- (Required) If
true
, enables debugging for Data Hub in the MarkLogic Server instance; otherwise, debugging is disabled. The default isfalse
.
Record Management
MarkLogic Data Hub provides a REST Client API extension which allows you to match and merge/unmerge records programmatically without running a flow.
Compares the specified record with other records and returns the list of possible matches.
- rs:uri
- (Required) The URI of the record to compare with other records.
- rs:flowName
- (Required) The name of a flow that includes a mastering step.
- rs:step
- The step number of the mastering step in the specified flow. This task uses the settings in the mastering step. The default is 1, which assumes that the first step in the flow is a mastering step.
- rs:includeMatchDetails
- If
true
, additional information about each positive match is provided. The default isfalse
. - rs:start
- The index of the first notification to return. The default is 1.
- rs:pageLength
- The number of notifications to return. The default is 20.
Merges the specified records according to the settings of the specified mastering step.
- rs:uri
- (Required) The URI of one of the records to merge. You must specify at least two URIs.
- rs:flowName
- (Required) The name of a flow that includes a mastering step.
- rs:step
- The step number of the mastering step in the specified flow. This task uses the settings in the mastering step. The default is 1, which assumes that the first step in the flow is a mastering step.
- rs:preview
- If
true
, no changes are made to the database and a simulated merged record is returned; otherwise, the merged record is saved to the database. The default isfalse
.
Reverses the set of merges that created the specified merged record.
- rs:mergeURI
- (Required) The URI of the record to unmerge.
- rs:retainAuditTrail
- If
true
, the merged record will be moved to an archive collection; otherwise, it will be deleted. The default istrue
. - rs:blockFutureMerges
- If
true
, the component records will be blocked from being merged together again. The default istrue
.
Returns the list of notifications about matches that are close to but did not exceed the merging threshold.
- rs:start
- The index of the first notification to return. The default is 1.
- rs:pageLength
- The number of notifications to return. The default is 10.
Returns specific values from the list of notifications about matches that are close to but did not exceed the merging threshold.
- rs:start
- The index of the first notification to return.
- rs:pageLength
- The number of notifications to return.
{ "key": "fieldname" }
{ ...
extractions: { "/path-to-uri.xml": { "key": "value-of-field" } }
}
/uri1.xml
contains:
<Person>
<PersonFirstName>Bob</PersonFirstName>
<PersonLastName>Smith</PersonLastName>
</Person>
And the body of the POST request contains:
{ "firstName": "PersonFirstName" }
The results include an extractions
node as follows:
{ ...
extractions: {
"/uri1.xml": { "firstName": "Bob" }
}
}
Returns the list of notifications about matches that are close to but did not exceed the merging threshold.
- rs:uris
- An array of strings containing the URIs of notifications to update.
- rs:status
- The new status of the notifications. Valid values are
read
andunread
.
Deletes the specified notification.
- rs:uri
- The URI of the notification to delete.
Returns the document-level history of the specified merged record.
- rs:uri
- (Required) The URI of a merged record.
Returns the history of the specified property or all properties of a merged record.
- rs:uri
- (Required) The URI of a merged record.
- rs:property
- The name of the specific property. The default is all properties.
"provenanceGranularityLevel" : "fine"
. See Set Provenance Granularity ManuallyJob Management
Returns job information based on the specified parameters.
- rs:jobid
- A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned. Used to return the job document associated with the specified job ID. You can specify either jobID or status, but not both.
- rs:status
- The status of the job:
started
,finished
,finished_with_errors
,running
,failed
,stop-on-error
, orcanceled
. Used to return the list of job documents associated with all jobs with the specified status. You can specify either jobID or status, but not both. - rs:flowNames
- The name of the flow. Used to return the job ID and job information of the latest run that includes the specified flow name. To specify additional flow names, repeat the parameter.
- rs:flow-name
- The name of the flow. Used to return the list of job documents associated with the all runs that include the specified flow name.
Data Hub 5.5.0 enables enterprise-wide monitoring tools to debug and analyze data pipelines across architectures. Tools like AWS cloudwatch, Splunk, and many others observe data pipelines. Data Hub enables you to see the results of the execution of data hub pipelines/activities in enterprise-wide monitoring tools.
To do so, you send a GET request to the REST-based Data Hub extension to get jobs data with parameters that include (jobID) or (status, flow names) and parameters like start time(range), end time(range), step name, and step type.
You receive the job data filtered based on the parameters sent by the GET request with all the required attributes, present in the jobs schema, including duration of a step.
The default pageLength caps results returned (e.g., 100)
The hubJobsREST endpoint returns Job documents.
For details on the existing mlJobsREST extension, see:
Parameters for querying /v1/resources/hubJobs
Parameter Name | Description |
rs:start | The page to start results from. (Default: 1) |
rs:pageLength | The max number of jobs to return on a page. (Default: 100) |
rs:jobId | The ID for the job. |
rs:jobStatus | The status of the jobs information to return. One of the following: [started, finished, finished_with_errors, running, failed, stop-on-error, canceled] |
rs:user | The username of the user that ran the job. |
rs:flowName | The name of the flow run by the job. |
rs:stepName | The name of a step that is run by a job. |
rs:stepDefinitionType | The step definition type of a step run by the job. One of the following: [ingestion, mapping, matching, merging, custom] |
rs:startTimeBegin | The time in ISO dateTime format that a startTime for a step run must be greater than or equal to. |
rs:startTimeEnd | The time in ISO dateTime format that a startTime for a step run must be less than or equal to. |
rs:endTimeBegin | The time in ISO dateTime format that a endTime for a step run must be greater than or equal to. |
Returns the batch documents for the specified step or batch within the specified job.
- rs:jobid
- (Required) A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.
- rs:step
- (Required) The sequence number of the step whose batch documents to return. You must specify either step or batchId, but not both.
- rs:batchid
- (Required) The ID of the batch whose documents to return. You must specify either step or batchId, but not both.
Transforms
Ingests your data using the specified ingestion step in the specified flow.
- flow-name
- (Optional) The name of the flow. The default is
default-ingestion
. - step
- (Optional if
flow-name
is not specified) The sequence number of the ingestion step to execute. Ifflow-name
is not specified, the default is 1. - job-id
- (Optional) A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.
- options
- (Optional) A JSON object containing additional options.
To add the CSV filename to the header of the envelope, include
"inputFileType" : "csv"
in the JSON object.To override the values of step properties at runtime, include
"propertyname" : "valueToUse"
in the JSON object.The step properties whose values can be overridden at runtime are:
outputFormat
with a value oftext
,json
,xml
, orbinary
.provenanceGranularityLevel
with a value ofcoarse
,fine
, oroff
.disableJobOutput
with a value oftrue
orfalse
.
In most cases, you can use the default flow default-ingestion
.
permissions
and collections
properties are not set in the resulting documents. Therefore, you must set these properties using a custom step after ingestion.Example:
http://localhost:8010/v1/documents?transform=mlRunIngest&
trans:flow-name=CurateCustomerJSON&
trans:step=1&
trans:options={"headers":{"sources":[{"name":"loadCustomersJSON"}],"createdOn":"currentDateTime","createdBy":"currentUser"},"outputFormat":"xml"}