Provenance and Lineage

Overview

In MarkLogic, provenance tracks the origin of the data and lineage is the history of the data. Provenance metadata is the combined set of provenance information and lineage information tracked by MarkLogic Data Hub. Provenance information is updated with every change made to the record from ingestion through its lifetime in the MarkLogic Server.

All provenance and lineage information is stored as XML documents (using the PROV XML schema) in the data-hub-JOBS database and are added to the protected collection http://marklogic.com/provenance-services/record. When provenance and lineage records are created, triples that define the relationships among the pieces of information are also generated.

You can view provenance information using the Query Console.

Security

You need the following security roles to access provenance and lineage information:

Role Description
ps-user

Allowed to:

  • Execute processes that write provenance and lineage information.
  • Read provenance and lineage information.

The auto-generated users (data-hub-admin-user, flow-developer, and flow-operator) are assigned this role automatically.

Tip: Assign the ps-user role to a separate user or users that are intended solely for reading provenance records.
ps-internal Allowed to update provenance records. Users with admin roles also have the same permissions.

Provenance Granularity

Data Hub provides three levels of granularity for provenance information: coarse (default), fine, and off.

Only a Jobs document is created. No other provenance or lineage information is tracked.

Even if provenance tracking is turned off for the flow or the step, previously collected provenance information is retained. Database administrator permissions are required to delete existing provenance information.