Loading TOC...
Application Developer's Guide (PDF)

Application Developer's Guide — Chapter 18

Template Driven Extraction (TDE)

Template Driven Extraction (TDE) enables you to define a relational lens over your document data, so you can query parts of your data using SQL or the Optic API. Templates let you specify which parts of documents make up rows in a view. You can also use templates to define a semantic lens, specifying which values from a document make up triples in the triple index.

TDE enables you to generate rows and triples from ingested documents based on predefined templates that describe the following:

  • The input data to match
  • The data transformations that apply to the matched data
  • The final data projections that are translated into indexed data.

TDE enables you to access the data in your documents in several ways, without changing the documents themselves. A relational lens is useful when you want to let SQL-savvy users access your data and when users want to create reports and visualizations using tools that communicate using SQL. It is also useful when you want to join entities and perform aggregates across documents. A semantic lens is useful when your documents contain some data that is naturally represented and queried as triples, using SPARQL.

TDE is applied during indexing at ingestion time and serves the following purposes:

TDE data is also used by the Optic API, as described in Optic API for Relational Operations.

The tde-admin role is required in order to insert a template into the schema database.

The main topics in this chapter are:

Security on TDE Documents

Operations on template documents are controlled by:

The http://marklogic.com/xdmp/tde collection, which is a protected collection that contains TDE template documents.

The tde-admin role, which is required to access the TDE protected collection.

The tde-view role, which is required to view documents in the TDE protected collection. Access to views can be further restricted by setting additional permissions on the template documents that define the views. Since the same view can be declared in multiple templates loaded with different permissions, the access to views should be controlled at the column level as follows:

Column level read permissions are implicit and are derived from the read permissions set on the template documents. Permissions on a column are not required to be identical and are ORed together. A user with a role that has at least one of the read permissions set on a column will be able to see the column.

If a user does not have permissions on any of the view's columns, the view itself is not visible.

For example, as shown in the illustration below:

  • Template document TD1 creates view View 1 with column C1 and C2. Template document TD1 was loaded with Read Permission 1.
  • Template document TD2 creates view View 1 with column C1 and C3. Template document TD2 was loaded with Read Permission 2.
  • Users with Permission 1 have access to columns C1 and C2 at query time.
  • Users with Permission 2 have access to columns C1 and C3 at query time.
  • Users without Permission 1 or Permission 2 will not have access to View 1 or any of its columns.

With this design:

  • Users can see columns referenced in templates they have access to.
  • Users cannot see additional columns referenced in templates they do not have access to.

If a document in a TDE protected collection makes use of Element Level Security, under most circumstances only the unprotected elements will be extracted; the protected elements will be ignored. For details on Element Level Security, see Element Level Security in the Security Guide.

Template View Elements

A template contains the following elements and their child elements:

ElementDescription
description
Optional description of the template.
collections
  collection
  collections-and
    collection
Optional collection scopes. Multiple collection scopes can be ORed or ANDed. See Collections.
directories
  directory
Optional directory scopes. Multiple directory scopes are ORed together. See Directories.
vars
  var
Optional intermediate variables extracted at the current context level. See Variables.
rows
  row
   schema-name
   view-name
   view-layout
     sparse
     identical
   columns
     column
       name
       scalar-type
       val
       nullable
       default
       invalid-values
          ignore
          reject
       reindexing
          hidden
          visible
       collation

These elements are used for template views, as described in Creating Template Views in the SQL Data Modeling Guide.

rows is a sequence of row descriptions and mappings, as described in Row in the SQL Data Modeling Guide.

columns is sequence of column descriptions and mappings, as described in Columns in the SQL Data Modeling Guide.

scalar-type is the type for the val. See Type Casting.

triples
  triple
    subject
      val
      invalid-values
    predicate
      val
      invalid-values
    object
      val
      invalid-values

These elements are used for triple-extraction templates, as described in Using a Template to Identify Triples in a Document in the Semantics Developer's Guide.

triples contains a sequence of triple extraction descriptions. Each triple description defines the data mapping for the subject, predicate and object.

An extracted triple's graph cannot be specified through the template. The graph is implicitly defined by the document's collection similar to embedded triples.

templates
  template
Optional sequence of sub-templates. For details, see Creating Views from Multiple Templates and Creating Views from Nested Templates in the SQL Data Modeling Guide.
path-namespaces
  path-namespace
Optional sequence of namespace bindings. See path-namespaces.
context
The lookup node that is used for template activation and data extraction. See Context.
enabled
A boolean that specifies whether the template is enabled (true) or disabled (false). Default value is true.

The context, vars, and columns identify XQuery elements or JSON properties by means of path expressions. Path expressions are based on XPath, which is described in XPath Quick Reference in the XQuery and XSLT Reference Guide and Traversing JSON Documents Using XPath.

Collections

A <collections> section defines the scope of the template to be confined only to documents in specific collections. The <collections> section is a top level OR of a sequence of:

  • <collection> that scope the template to a specific collection.
  • <collections-and> that contains a sequence of <collection> that are ANDed together.

The following collection logical combinations are possible:

ORed collections:

<collections>
  <collection>A</collection>
  <collection>B</collection>
</collections>

ANDed collections:

<collections>
  <collections-and>
    <collection>A</collection>
    <collection>B</collection>
  </collections-and>
</collections>

OR of ANDed collections:

<collections>
  <collection>A</collection>
  <collection>B</collection>
  <collections-and>
    <collection>C</collection>
    <collection>D</collection>
  </collections-and>
  <collections-and>
    <collection>E</collection>
    <collection>F</collection>
  </collections-and>
</collections>

Directories

A <directories> section defines the scope of the template to be confined only to documents in specific directories. The <directories> section is a top level OR of a sequence of <directory> elements that scope the template to a specific directory.

path-namespaces

A <path-namespaces> section is a top level of one or more <path-namespace> elements, which contain:

  • <prefix> the namespace prefix.
  • <namespace-uri> the namespace URI.

For example, a path namespace binding can be specified in the template as follows:

<path-namespaces>
  <path-namespace>
    <prefix>wb</prefix>
    <namespace-uri>http://marklogic.com/wb</namespace-uri>
  </path-namespace>
</path-namespaces>

The namespace prefix definitions are stored in the template documents and not in the configuration of the target database. Otherwise, templates cannot be compiled into code without knowing the target database configuration that uses them.

Context

The context tag defines the lookup node that is used for template activation and data extraction. Path expressions occurring inside vars, rows, or triples are relative to the context element of their parent template. The context defines an anchor in the XML/JSON tree where data is collected by walking up and down the tree relative to the anchor. Any indexable path expression is valid in the context element, therefore predicates are allowed. The context element of a sub-template is relative to the context element of its parent template.

For example:

<context>/Employee</context>
<context>/MedlineCitation/Article</context>

It is important to understand that context defines the node from which to extract a single row. If you want to extract multiple rows from the document, the context must be set to the parent element of those rows. For example, you have 'order' documents that are as structured as follows:

<order>
  <order-num>10663</order-num>
  <order-date>2017-01-15</order-date>
  <items>
    <item>
      <product>SpeedPro Ultimate</product>
      <price>999</price>
      <quantity>1</quantity>
    </item>
    <item>
      <product>Ladies Racer Helmet</product>
      <price>115</price>
      <quantity>1</quantity>
    </item>
  </items>
</order>

Each order document contains one or more <item> nodes. You want to create a view template that extracts the <product>, <price>, and <quantity> values from each <item> node. A context of /order and column values, such as items/item/product, will trigger a single row extraction for the entire document, so the only way this will work is if the document has only one <item> node. To extract the content of all of the <item> nodes as multiple rows, the context must be /order/items/item. In this case, if you wanted to also extract <order-num>, the column value would be ../../order-num.

The context must be a searchable path. It may contain wildcards, such as ‘*', but, for performance reasons, wildcards should not be used unless their value outweighs the performance costs.

Variables

Variables are intermediate data projections needed for data transformation and are defined under var elements. Variables can reference other variables inside their transformation section val, for the cases where several intermediate projection/transformations are needed before the last projection into the column/triple. The expression inside the val code is relative to the context element of the current template in which the var is defined. See Template Dialect and Data Transformation Functions for the types of expressions allowed in a val.

For example:

<context>/northwind/Orders/Order</context>

.......

<vars>
    <var>
      <name>OrderID</name>
      <val>./@OrderID</val>
    </var>
</vars>

.......

<column>
    <name>OrderID</name>
    <scalar-type>long</scalar-type>
    <val>$OrderID</val>
</column>

You do not type variable values in the var description. Rather, the variable value is typed in the column description.

Template Dialect and Data Transformation Functions

Templates support a dialect using a subset of XQuery with limited functionalities where only a subset of functions are available.

The template dialect supports the following types of expressions described in the Expressions section of the An XML Query Language specification:

More complex operations like looping, FLWOR statements, iterations, and XML construction are not supported within the dialect. The property axis property:: is also not supported.

The supported XQuery functions are listed in the following sections:

Type Casting

Mathematical Functions

And all the math (http://marklogic.com/xdmp/math namespace) built-in functions except aggregate functions like variance and stddev.

Validating and Inserting a Template

The tde-admin role is required in order to insert a template into the schema database.

For best performance, it is recommended that you do not configure your content database to use the default Schemas database and instead create your own schemas database for your template documents. If you create multiple content databases to hold documents to be extracted by TDE, each content database should have its own schema database. Failure to do so may result in unexpected indexing behavior on the content databases.

You should always validate your template before inserting your view into a schema database. To validate your view, use the tde:validate function as follows:

let $viewTemplate :=
<template xmlns="http://marklogic.com/xdmp/tde">
  ..... 
</template>
return tde:validate($viewTemplate)

A valid template will return the following:

<map:map xmlns:map="http://marklogic.com/xdmp/map"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <map:entry key="valid">
     <map:value xsi:type="xs:boolean">true</map:value>
  </map:entry>
</map:map>

Do not use xdmp:validate to validate your template, as this function may miss some validation steps.

After you have confirmed that the view template is valid, you can insert your view template into the schema database used by the content database holding the document data. You can use any method for inserting documents into the database to insert a view template, but you must insert the template document into the http://marklogic.com/xdmp/tde collection.

The tde:template-insert function is a convenience that validates the template, inserts the template document into the tde collection in the schema database (if executed on the content database) with the default permissions, and triggers a re-index of the database.

When a template is inserted, only those document fragments affected by the template are re-indexed.

For example, to define and insert a view template, you would enter the following:

xquery version "1.0-ml"; 
import module namespace tde = "http://marklogic.com/xdmp/tde" 
        at "/MarkLogic/tde.xqy";
let $ClinicalView := 
<template xmlns="http://marklogic.com/xdmp/tde">
  <description>populates patients' data</description>
  <context>/Citation/Article</context>
  <rows>
    <row>
      <schema-name>Medical2</schema-name>
      <view-name>Publications</view-name>
      <columns>
        <column>
          <name>ID</name>
          <scalar-type>long</scalar-type>
          <val>../ID</val>
        </column>
        <column>
          <name>ISSN</name>
          <scalar-type>string</scalar-type>
          <val>Journal/ISSN</val>
        </column>
      </columns>
    </row>
  </rows>
</template>
return tde:template-insert("/Template.xml", $ClinicalView)

If you use an alternative insert operation, you must explicitly insert the template document into the http://marklogic.com/xdmp/tde collection of the schema database used by your content database. For example:

return xdmp:document-insert(
      "/Template.xml",
      $ClinicalView,
      (),
      "http://marklogic.com/xdmp/tde")

Templates and Non-Conforming Documents

Once you have inserted a TDE template for a content database, an attempt to insert a document that does not conform to the template may generate and error.

'Doesn't conform' might mean that the template says you should have a price element at some path and the column is not nullable, and there's no default value. But the inserted document has no price element at that path, or perhaps there is a price in the document but it can't be cast to the type of the column.

If the document is already in the database and you add the template, you may not want to delete the non-conforming document, but you do want to be aware of its existence. If you set the log level to debug, then in the case where you added a template and some existing documents are non-conforming, you'll get an error in the error log for each document that doesn't get indexed. For details on setting the log level, see Understanding the Log Levels in the Administrator's Guide.

If the template is already in place and you try to insert the non-conforming document, there are two possible outcomes:

  • The insert fails with an error
  • The insert succeeds, but the row with the missing price column is skipped (it doesn't get added to the index)

You can control which of outcome happens by setting invalid-values in the template to reject (reject the non-conforming document and throw an error) or ignore (allow the document insert and ignore that row for indexing purposes).

Enabling and Disabling Templates

Templates can be enabled and disabled by modifying the <enabled> flag on the template. Set the <enabled> flag to true to enable the template or false to disable,

For example, to disable the template set the <enabled> flag to false, as follows:

<template xmlns="http://marklogic.com/xdmp/tde">
    <context>/foo/bar</context>
    <enabled>false</enabled>
       ...
</template>

Reindexing will start automatically every time a template is enabled or disabled.

Deleting Templates

Template documents can be safely deleted once they have been disabled and after enough time has elapsed to make sure that the reindexing related to the disabled template has completed.

Accidental deletion of a template can be fixed by:

  1. Reinserting the template in a disabled state.
  2. Reusing the same template document URI for a new template.
  3. Manually reindexing the database.
« Previous chapter
Next chapter »
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy