Provenance and Lineage

Overview

In MarkLogic, provenance tracks the origin of the data and lineage is the history of the data. Provenance metadata is the combined set of provenance information and lineage information tracked by MarkLogic Data Hub. Provenance information is updated with every change made to the record from ingestion through its lifetime in the MarkLogic Server.

All provenance and lineage information is stored as XML documents (using the PROV XML schema) in the data-hub-JOBS database and are added to the protected collection http://marklogic.com/provenance-services/record. When provenance and lineage records are created, triples that define the relationships among the pieces of information are also generated.

You can view provenance information using the Query Console.

Security

You need the following security roles to access provenance and lineage information:

Role Description
ps-user

Allowed to:

  • Execute processes that write provenance and lineage information.
  • Read provenance and lineage information.

The auto-generated users (data-hub-admin-user, flow-developer, and flow-operator) are assigned this role automatically.

Tip: Assign the ps-user role to a separate user or users that are intended solely for reading provenance records.
ps-internal Allowed to update provenance records. Users with admin roles also have the same permissions.

Provenance Granularity

Data Hub provides three levels of granularity for provenance information: coarse (default), fine, and off.

Only a Jobs document is created. No other provenance or lineage information is tracked.

"provenanceGranularityLevel" : "coarse" "provenanceGranularityLevel" : "fine" "provenanceGranularityLevel" : "off"

Document-level provenance information is tracked.

Provenance information for the current flow or step is not stored.

CAUTION: Do not turn off provenance unless you are certain the project will never make use of provenance information.

Property-level provenance information is also tracked.

In a mapping step, provenance information includes every entity property and the XPath of the source field mapped to it.

The set of provenance information is not customizable.
"provenanceGranularityLevel" : "coarse" "provenanceGranularityLevel" : "fine" "provenanceGranularityLevel" : "off"

Document-level provenance information is tracked.

Provenance information for the current flow or step is not stored.

CAUTION: Do not turn off provenance unless you are certain the project will never make use of provenance information.

Property-level provenance information is also tracked.

In a matching, merging, or mastering step, additional provenance information is tracked.

The set of provenance information is not customizable.
"provenanceGranularityLevel" : "coarse" "provenanceGranularityLevel" : "fine" "provenanceGranularityLevel" : "off"

Document-level provenance information is tracked.

Provenance information for the current flow or step is not stored.

CAUTION: Do not turn off provenance unless you are certain the project will never make use of provenance information.

Property-level provenance information is also tracked.

In a matching, merging, or mastering step, additional provenance information is tracked.

The set of provenance information is not customizable.

Regardless of the value of provenanceGranularityLevel,

  • Every merged record contains provenance information from all its source records. If provenanceGranularityLevel is coarse or fine, the merged record also contains the provenance information for the mastering step run.
  • The mastering summary is created as part of a mastering step or a merging step, but not a matching step.
"provenanceGranularityLevel" : "coarse" "provenanceGranularityLevel" : "fine" "provenanceGranularityLevel" : "off"

Document-level provenance information is tracked.

Provenance information for the current flow or step is not stored.

CAUTION: Do not turn off provenance unless you are certain the project will never make use of provenance information.

Property-level provenance information is also tracked.

In a matching, merging, or mastering step, additional provenance information is tracked.

The set of provenance information is not customizable.

Regardless of the value of provenanceGranularityLevel,

  • Every merged record contains provenance information from all its source records. If provenanceGranularityLevel is coarse or fine, the merged record also contains the provenance information for the mastering step run.
  • The mastering summary is created as part of a mastering step or a merging step, but not a matching step.

Only a Jobs document is created. No other provenance or lineage information is tracked.

In your custom step module, you can add code to generate document-level and property-level provenance. See Editing a Custom Step Module.

"provenanceGranularityLevel" : "coarse" "provenanceGranularityLevel" : "fine" "provenanceGranularityLevel" : "off"
Document-level provenance information is tracked.

Provenance information for the current flow or step is not stored.

CAUTION: Do not turn off provenance unless you are certain the project will never make use of provenance information.
The set of document-level provenance information is not customizable.
The set of property-level provenance information is customizable for custom steps.
"provenanceGranularityLevel" : "coarse" "provenanceGranularityLevel" : "fine" "provenanceGranularityLevel" : "off"
Document-level provenance information is tracked.

Provenance information for the current flow or step is not stored.

CAUTION: Do not turn off provenance unless you are certain the project will never make use of provenance information.
The set of document-level provenance information is not customizable.
The set of property-level provenance information is customizable for custom steps.
"provenanceGranularityLevel" : "coarse" "provenanceGranularityLevel" : "fine" "provenanceGranularityLevel" : "off"

Only a Jobs document is created. No other provenance or lineage information is tracked.

In your custom step module, you can add code to generate document-level and property-level provenance. See Editing a Custom Step Module.

Even if provenance tracking is turned off for the flow or the step, previously collected provenance information is retained. Database administrator permissions are required to delete existing provenance information.