Loading TOC...
Application Developer's Guide (PDF)

Application Developer's Guide — Chapter 9

Properties Documents and Directories

This chapter describes properties documents and directories in MarkLogic Server. It includes the following sections:

Properties Documents

A properties document is an XML document that shares the same URI with a document in a database. Every document can have a corresponding properties document, although the properties document is only created if properties are created. The properties document is typically used to store metadata related to its corresponding document, although you can store any XML data in a properties document, as long as it conforms to the properties document schema. A document typically exists at a given URI in order to create a properties document, although it is possible to create a document and add properties to it in a single transaction, and it is also possible to create a property where no document exists. The properties document is stored in a separate fragment to its corresponding document. This section describes properties documents and the APIs for accessing them, and includes the following subsections:

Properties Document Namespace and Schema

Properties documents are XML documents that must conform to the properties.xsd schema. The properties.xsd schema is copied to the <install_dir>/Config directory at installation time.

The properties schema is assigned the prop namespace prefix, which is predefined in the server:

http://marklogic.com/xdmp/property

The following listing shows the properties.xsd schema:

<xs:schema targetNamespace="http://marklogic.com/xdmp/property"
 xsi:schemaLocation="http://www.w3.org/2001/XMLSchema XMLSchema.xsd
                     http://marklogic.com/xdmp/security security.xsd"
 xmlns="http://marklogic.com/xdmp/property"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns:xhtml="http://www.w3.org/1999/xhtml"
 xmlns:sec="http://marklogic.com/xdmp/security">

  <xs:complexType name="properties">
    <xs:annotation>
      <xs:documentation>
        A set of document properties.
      </xs:documentation>
      <xs:appinfo>
      </xs:appinfo>
    </xs:annotation>
    <xs:choice minOccurs="1" maxOccurs="unbounded">
      <xs:any/>
    </xs:choice>
  </xs:complexType>

  <xs:element name="properties" type="properties">
    <xs:annotation>
      <xs:documentation>
        The container for properties.
      </xs:documentation>
      <xs:appinfo>
      </xs:appinfo>
    </xs:annotation>
  </xs:element>

  <xs:simpleType name="directory">
    <xs:annotation>
      <xs:documentation>
        A directory indicator.
      </xs:documentation>
      <xs:appinfo>
      </xs:appinfo>
    </xs:annotation>
    <xs:restriction base="xs:anySimpleType">
    </xs:restriction>
  </xs:simpleType>

  <xs:element name="directory" type="directory">
    <xs:annotation>
      <xs:documentation>
        The indicator for a directory.
      </xs:documentation>
      <xs:appinfo>
      </xs:appinfo>
    </xs:annotation>
  </xs:element>

  <xs:element name="last-modified" type="last-modified">
    <xs:annotation>
      <xs:documentation>
        The timestamp of last document modification.
      </xs:documentation>
      <xs:appinfo>
      </xs:appinfo>
    </xs:annotation>
  </xs:element>

  <xs:simpleType name="last-modified">
    <xs:annotation>
      <xs:documentation>
        A timestamp of the last time something was modified.
      </xs:documentation>
      <xs:appinfo>
      </xs:appinfo>
    </xs:annotation>
    <xs:restriction base="xs:dateTime">
    </xs:restriction>
  </xs:simpleType>

</xs:schema>

APIs on Properties Documents

The APIs for properties documents are XQuery functions which allow you to list, add, and set properties in a properties document. The properties APIs provide access to the top-level elements in properties documents. Because the properties are XML elements, you can use XPath to navigate to any children or descendants of the top-level property elements. The properties document is tied to its corresponding document and shares its URI; when you delete a document, its properties document is also deleted.

The following APIs are available to access and manipulate properties documents:

For the signatures and descriptions of these APIs, see the MarkLogic XQuery and XSLT Function Reference.

XPath property Axis

MarkLogic has extended XPath (available in both XQuery and XSLT) to include the property axis. The property axis (property::) allows you to write an XPath expression to search through items in the properties document for a given URI. These expression allow you to perform joins across the document and property axes, which is useful when storing state information for a document in a property. For details on this approach, see Using Properties for Document Processing.

The property axis is similar to the forward and reverse axes in an XPath expression. For example, you can use the child:: forward axis to traverse to a child element in a document. For details on the XPath axes, see the XPath 2.0 specification and XPath Quick Reference in the XQuery and XSLT Reference Guide.

The property axis contains all of the children of the properties document node for a given URI.

The following example shows how you can use the property axis to access properties for a document while querying the document:

Create a test document as follows:

xdmp:document-insert("/test/123.xml",
  <test>
    <element>123</element>
  </test>)

Add a property to the properties document for the /test/123.xml document:

xdmp:document-add-properties("/test/123.xml", 
  <hello>hello there</hello>)

If you list the properties for the /test/123.xml document, you will see the property you just added:

xdmp:document-properties("/test/123.xml")
=>
<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
  <hello>hello there</hello>
</prop:properties>

You can now search through the property axis of the /test/123.xml document, as follows:

doc("/test/123.xml")/property::hello
=>
<hello>hello there</hello>

Protected Properties

The following properties are protected, and they can only be created or modified by the system:

  • prop:directory
  • prop:last-modified

These properties are reserved for use directly by MarkLogic Server; attempts to add or delete properties with these names fail with an exception.

Creating Element Indexes on a Properties Document Element

Because properties documents are XML documents, you can create element (range) indexes on elements within a properties document. If you use properties to store numeric or date metadata about the document to which the properties document corresponds, for example, you can create an element index to speed up queries that access the metadata.

Sample Properties Documents

Properties documents are XML documents that conform to the schema described in Properties Document Namespace and Schema. You can list the contents of a properties document with the xdmp:document-properties function. If there is no properties document at the specified URI, the function returns the empty sequence. A properties document for a directory has a single empty prop:directory element. For example, if there exists a directory at the URI http://myDirectory/, the xdmp:document-properties command returns a properties document as follows:

xdmp:document-properties("http://myDirectory/")
=>
<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
  <prop:directory/>
</prop:properties>

You can add whatever you want to a properties document (as long as it conforms to the properties schema). If you run the function xdmp:document-properties with no arguments, it returns a sequence of all the properties documents in the database.

Standalone Properties Documents

Typically, properties documents are created alongside the corresponding document that shares its URI. It is possible, however, to create a properties document at a URI with no coresponding document at that URI. Such a properties document is known as a standalone properties document. To create a standalone properties document, use the xdmp:document-add-properties or xdmp:document-set-properties APIs, and optionally add the xdmp:document-set-permissions, xdmp:document-set-collections, and/or xdmp:document-set-quality APIs to set the permissions, collections, and/or quality on the properties document.

The following example creates a properties document and sets permissions on it:

xquery version "1.0-ml";

xdmp:document-set-properties("/my-props.xml", <my-props/>),
xdmp:document-set-permissions("/my-props.xml", 
   (xdmp:permission("dls-user", "read"),
    xdmp:permission("dls-user", "update")))

If you then run xdmp:document-properties on the URI, it returns the new properties document:

xquery version "1.0-ml";

xdmp:document-properties("/my-props.xml")
(: returns: 
<?xml version="1.0" encoding="ASCII"?>
<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
  <my-props/>
  <prop:last-modified>2010-06-18T18:19:10-07:00</prop:last-modified>
</prop:properties>
:)

Similarly, you can pass in functions to set the collections and quality on the standalone properties document, either when you create it or after it is created.

Using Properties for Document Processing

When you need to update large numbers of documents, sometimes in multi-step processes, you often need to keep track of the current state of each document. For example, if you have a content processing application that updates millions of documents in three steps, you need to have a way of programatically determining which documents have not been processed at all, which have completed step 1, which have completed step 2, and so on.

This section describes how to use properties to store metadata for use in a document processing pipeline, it includes the following subsections:

Using the property Axis to Determine Document State

You can use properties documents to store state information about documents that undergo multi-step processing. Joining across properties documents can then determine which documents have been processed and which have not. The queries that perform these joins use the property:: axis (for details, see XPath property Axis).

Joins across the properties axis that have predicates are optimized for performance. For example, the following returns foo root elements from documents that have a property bar:

foo[property::bar]

The following examples show the types of queries that are optimized for performance (where /a/b/c is some XPath expression):

  • Property axis in predicates:
    /a/b/c[property::bar]
  • Negation tests on property axis:
    /a/b/c[not(property::bar = "baz")]
  • Continuing path expression after the property predicate:
    /a/b/c[property::bar and bob = 5]/d/e
  • Equivalent FLWOR expressions:
    for $f in /a/b/c 
    where $f/property::bar = "baz"
    return $f

Other types of expressions will work but are not optimized for performance, including the following:

  • If you want the bar property of documents whose root elements are foo:
    /foo/property::bar

Document Processing Problem

The approach outlined in this section works well for situations such as the following:

  • I have already loaded 1 million documents and now want to update all of them. The psuedo-code for this is as follows:
    for $d in fn:doc() 
    return some-update($d) 

    These types of queries will eventually run out of tree cache memory and fail.

  • When iterative calls of the following form become progressively slow:
    for $d in fn:doc()[k to k+10000] 
    return some-update($d)

For these types of scenarios, using properties to test whether a document needs processing is an effective way of being able to batch up the updates into manageable chunks.

Solution for Document Processing

This content processing technique works in a wide variety of situations This approach satisfies the following requirements:

  • Works with large existing datasets.
  • Does not require you to know before you load the datasets that you are going to need to further processing to them later.
  • This approach works in a situations in which data is still arriving (for example, new data is added every day).
  • Needs to be able to ultimately transition into a steady state content processing enabled environment.

The following are the basic steps of the document processing approach:

  1. Take an iterative strategy, but one that does not become progressively slow.
  2. Split the reprocessing activity into multiple updates.
  3. Use properties (or lack thereof) to identify the documents that (still) need processing.
  4. Repeatedly call the same module, updating its property as well as updating the document:
    for $p in fn:doc()/root[not(property::some-update)][1 to 10000]
    return some-update($d)
  5. If there are any documents that still need processing, invoke the module again.
  6. The psuedo-code for the module that processes documents that do not have a specific property is as follows:
    let $docs := get n documents that have no properties
    return
    for $processDoc in $docs
    return if (empty $processDoc)
           then ()
           else ( process-document($processDoc),
                  update-property($processDoc) )
    ,
    xdmp:spawn(process_module) 

    This psuedo-code does the following:

    • gets the URIs of documents that do not have a specific property
    • for each URI, check if the specific property exists
    • if the property exists, do nothing to that document (it has already been updated)
    • if the property does not exist, do the update to the document and the update to the property
    • continue this for all of the URIs
    • when all of the URIs have been processed, call the module again to get any new documents (ones with no properties)
  7. (Optional) Automate the process by setting up a Content Processing Pipeline.

Basic Commands for Running Modules

The following built-in functions are needed to perform automated content processing:

  • To put a module on Task Server Queue:
    xdmp:spawn($database, $root, $path)
  • To evaluate an entire module (similar to xdmp:eval, but for modules):
    xdmp:invoke($path, $external-vars)
    xdmp:invoke-in($path, $database-id, $external-vars)

Directories

Directories have many uses, including organizing your document URIs and using them with WebDAV servers. This section includes the following items about directories:

Properties and Directories

When you create a directory, MarkLogic Server creates a properties document with a prop:directory element. If you run the xdmp:document-properties command on the URI corresponding to a directory, the command returns a properties document with an empty prop:directory element, as shown in the following example:

xdmp:directory-create("/myDirectory/");
xdmp:document-properties("/myDirectory/")
=>
<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
  <prop:directory/>
</prop:properties>

You can create a directory with any unique URI, but the convention is for directory URIs to end with a forward slash (/). It is possible to create a document with the same URI as a directory, but this is not recommended; the best practice is to reserve URIs ending in slashes for directories.

Because xdmp:document-properties with no arguments returns the properties documents for all properties documents in the database, and because each directory has a prop:directory element, you can easily write a query that returns all of the directories in the database. Use the xdmp:node-uri function to accomplish this as follows:

xquery version "1.0-ml";

for $x in xdmp:document-properties()/prop:properties/prop:directory
return <directory-uri>{xdmp:node-uri($x)}</directory-uri>

Directories and WebDAV Servers

Directories are needed for use in WebDAV servers. To create a document that can be accessed from a WebDAV client, the parent directory must exist. The parent directory of a document is the directory in which the URI is the prefix of the document (for example, the directory of the URI http://myserver/doc.xml is http://myserver/). When using a database with a WebDAV server, ensure that the directory creation setting on the database configuration is set to automatic (this is the default setting), which causes parent directories to be created when documents are created. For information on using directories in WebDAV servers, see WebDAV Servers in the Administrator's Guide.

Directories Versus Collections

You can use both directories and collections to organize documents in a database. The following are important differences between directories and collections:

  • Directories are hierarchical in structure (like a filesystem directory structure). Collections do not have this requirement. Because directories are hierarchical, a directory URI must contain any parent directories. Collection URIs do not need to have any relation to documents that belong to a collection. For example, a directory named http://marklogic.com/a/b/c/d/e/ (where http://marklogic.com/ is the root) requires the existence of the parent directories d, c, b, and a. With collections, any document (regardless of its URI) can belong to a collection with the given URI.
  • Directories are required for WebDAV clients to see documents. In other words, to see a document with URI /a/b/hello/goodbye in a WebDAV server with /a/b/ as the root, directories with the following URIs must exist in the database:
    /a/b//a/b/hello/

Except for the fact that you can use both directories and collections to organize documents, directories are unrelated to collections. For details on collections, see Collections in the Search Developer's Guide. For details on WebDAV servers, see WebDAV Servers in the Administrator's Guide.

Permissions On Properties and Directories

Like any document in a MarkLogic Server database, a properties document can have permissions. Since a directory has a properties document (with an empty prop:directory element), directories can also have permissions. Permissions on properties documents are the same as the permissions on their corresponding documents, and you can list the permissions with the xdmp:document-get-permissions function. Similarly, you can list the permissions on a directory with the xdmp:document-get-permissions function. For details on permissions and on security, see Security Guide.

Example: Directory and Document Browser

Using properties documents, you can build a simple application that lists the documents and directories under a URI. The following sample code uses the xdmp:directory function to list the children of a directory (which correspond to the URIs of the documents in the directory), and the xdmp:directory-properties function to find the prop:directory element, indicating that a URI is a directory. This example has two parts:

Directory Browser Code

The following is sample code for a very simple directory browser.

xquery version "1.0-ml";
(:   directory browser  
           Place in Modules database and give execute permission :)

declare namespace prop="http://marklogic.com/xdmp/property";

(: Set the root directory of your AppServer for the 
   value of $rootdir :)
let $rootdir := (xdmp:modules-root()) 
(: take all but the last part of the request path, after the 
   initial slash :)
let $dirpath := fn:substring-after(fn:string-join(fn:tokenize(
                xdmp:get-request-path(), "/")[1 to last() - 1], 
                "/"), "/")
let $basedir := if ( $dirpath eq "" )
                then ( $rootdir )
                else fn:concat($rootdir, $dirpath, "/")
let $uri := xdmp:get-request-field("uri", $basedir)
return if (ends-with($uri, "/")) then
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <title>MarkLogic Server Directory Browser</title>
   </head>
   <body>
     <h1>Contents of {$uri}</h1>
	<h3>Documents</h3>
{
   for $d in xdmp:directory($uri, "1")
   let $u := xdmp:node-uri($d)
   (: get the last two, and take the last non-empty string :)
   let $basename :=
     tokenize($u, "/")[last(), last() - 1][not(. = "")][last()]
   order by $basename
   return element p {
     element a {

(:  The following should work for all $basedir values, as long 
    as the string represented by $basedir is unique in the 
    document URI :)
       attribute href { substring-after($u,$basedir) },
       $basename
     }
   }
}
       <h3>Directories</h3>
{
   for $d in xdmp:directory-properties($uri, "1")//prop:directory
   let $u := xdmp:node-uri($d)
   (: get the last two, and take the last non-empty string :)
   let $basename :=
     tokenize($u, "/")[last(), last() - 1][not(. = "")][last()]
   order by $basename
   return element p {
     element a {
       attribute href { concat(
                               xdmp:get-request-path(),
                               "?uri=",
                               $u) },
       concat($basename, "/")
     }
   }
}
</body>
</html>
else doc($uri)

(: browser.xqy :)

This application writes out an HTML document with links to the documents and directories in the root of the server. The application finds the documents in the root directory using the xdmp:directory function, finds the directories using the xdmp:directory-properties function, does some string manipulation to get the last part of the URI to display, and keeps the state using the application server request object built-in XQuery functions (xdmp:get-request-field and xdmp:get-request-path).

Setting Up the Directory Browser

To run this directory browser application, perform the following:

  1. Create an HTTP Server and configure it as follows:
    1. Set the Modules database to be the same database as the Documents database. For example, if the database setting is set to the database named my-database, set the modules database to my-database as well.
    2. Set the HTTP Server root to http://myDirectory/, or set the root to another value and modify the $rootdir variable in the directory browser code so it matches your HTTP Server root.
    3. Set the port to 9001, or to a port number not currently in use.
  2. Copy the sample code into a file named browser.xqy. If needed, modify the $rootdir variable to match your HTTP Server root. Using the xdmp:modules-root function, as in the sample code, will automatically get the value of the App Server root.
  3. Load the browser.xqy file into the Modules database at the top level of the HTTP Server root. For example, if the HTTP Server root is http://myDirectory/, load the browser.xqy file into the database with the URI http://myDirectory/browser.xqy. You can load the document either via a WebDAV client (if you also have a WebDAV server pointed to this root) or with the xdmp:document-load function.
  4. Make sure the browser.xqy document has execute permissions. You can check the permissions with the following function:
    xdmp:document-get-permissions("http://myDirectory/browser.xqy")

    This command returns all of the permissions on the document. It should have execute capability for a role possessed by the user running the application. If it does not, you can add the permissions with a command similar to the following:

    xdmp:document-add-permissions("http://myDirectory/browser.xqy", 
                                xdmp:permission("myRole", "execute"))

    where myRole is a role possessed by the user running the application.

  5. Load some other documents into the HTTP Server root. For example, drag and drop some documents and folders into a WebDAV client (if you also have a WebDAV server pointed to this root).
  6. Access the browser.xqy file with a web browser using the host and port number from the HTTP Server. For example, if you are running on your local machine and you have set the HTTP Server port to 9001, you can run this application from the URL http://localhost:9001/browser.xqy.
You should see links to the documents and directories you loaded into the database. If you did not load any other documents, you will just see a link to the browser.xqy file.
« Previous chapter
Next chapter »