Loading TOC...
Content Processing Framework Guide (PDF)

Content Processing Framework Guide — Chapter 5

Developing Modules to Process Content

This chapter describes modules in the MarkLogic Server Content Processing Framework, and includes the following sections:

Overview of Modules

While domains and pipelines provide a framework to build content processing applications, the actual work of transforming or enhancing content is done through XQuery modules. An XQuery module can do an arbitrary amount of work; it can be very simple, very complex, or somewhere in between.

There are two types of XQuery modules used in content processing applications:

  • Condition modules
  • Action modules

Condition modules test for a condition and return a boolean value (true or false). Action modules typically perform some processing on the document and then call cpf:success or cpf:failure to advance the state, according to the definitions in the current pipeline.

Action modules can call out to web services inside or outside of MarkLogic Server, they can perform transformations on the document within MarkLogic Server, or they can perform any work needed in the current phase of the pipeline.

Instead of using XQuery modules for conditions and actions, you can also use XSLT modules. If you have a path to an XSLT stylesheet in a pipeline, the action or condition is run with the specified XSLT code. For details about using XSLT stylesheets in CPF actions, see Using XSLT Stylesheets Instead of Action Modules.

Loading Modules Into the Database

XQuery modules used as action and condition modules should be stored in the database. Load the XQuery modules into the database and root specified in the evaluation context for the domain configuration. For details about where to load the XQuery modules, see Domain Scope and Code Evaluation Context.

External Variables Available to Modules

Each CPF action condition and action module has the following external variables available:

  • $cpf:document-uri: The URI of the document being processed. In Server-Side JavaScript, the name of this variable is uri.
  • $cpf:transition: The name of the transition being executed. Every action should use this external variable so it can pass the value into cpf:success and cpf:failure to advance the state of the document. In Server-Side JavaScript, the name of this variable is transition.
  • $cpf:options: The options XML node from the pipeline action. You can use this to pass in options that the module uses, so you can use the same module with different pipelines and get different behavior. In Server-Side JavaScript, the name of this variable is options.

You use these external variables to get the URI of the document being processed, the name of the transition being executed, and any options that the pipeline passes through to the module. To use these external variables in your XQuery modules, import them as external variables to your modules, as in the following code snippet from an XQuery 1.0-ml prolog:

declare variable $cpf:document-uri as xs:string external;
declare variable $cpf:transition as element() external;
declare variable $cpf:options as element() external;

To use the external variables in your Server-Side JavaScript code, you need to require the cpf.xqy module and declare the variables in your module, as in the following example:

declareUpdate();
var cpf = require("/MarkLogic/cpf/cpf.xqy");
var uri;
var transition;
if (cpf["check-transition"](uri,transition)) {
  try {
    xdmp.nodeInsertChild(cts.doc(uri).xpath("/book"),
       fn.head(xdmp.unquote('<copyright><year>2010</year><holder>The Publisher</holder></copyright>')).root);
    xdmp.log("add copyright ran OK");
    cpf["success"](uri,transition,null);
  } catch (e)
  {
    cpf["failure"](uri,transition,e,null);
  }
 
} else {
} 

Design Patterns and Rules

The Content Processing Framework is designed with certain assumptions about what the modules called in a pipeline will do. This section describes these rules and provides XQuery design patterns to help you follow the rules in your XQuery code. It is important to follow these rules in your XQuery modules; not following these rules can lead to unexpected results. The following topics are included:

Condition Modules Must Return a Boolean

Condition modules must return a boolean value (true or false).

In one common scenario, a condition module checks either the existence or the value of an element in the document or in its properties document. If it exists, then the module returns true and the document needs processing for the current phase of the pipeline. Another scenario is that the condition performs some specialized logic based on some part of the document contents. The logic does not even need to pertain to the document, as long as the module returns true or false.

Condition Modules Should Not Update Documents

A condition module should only return a boolean value--it should not perform any other work. This is an assumption of the pipeline design. If you do perform updates in conditions, it will change the document from the state it was in when the event occurred (potentially causing an additional trigger to fire). Do not perform any document updates in a condition module; doing updates in conditions can cause non-deterministic behavior.

Action Modules Use try/catch With cpf:success and cpf:failure

The mechanism for transitioning a document from one state to the next is carried out by two functions in the cpf.xqy XQuery module: cpf:success and cpf:failure. These functions handle the logic to advance a document either to the on-success or to the on-failure state specified in the pipeline.

Action modules must call either cpf:success or cpf:failure exactly once. By using these functions in a try/catch expression in your XQuery code, it is easy to either advance the document to the success state when the code runs without exceptions, or to catch any exceptions in the XQuery code and then put the document in a failure state. This ensures that exactly one of these functions is called. In the XQuery code, you concatenate the cpf:success call following your action code, and return cpf:failure in the catch clause (which is run only if the try clause throws an exception).

The try/catch expression ensures that the state is advanced in the same transaction that performs the document processing. This way, if processing is interrupted, the state of the document always matches the actual state of the content processing.

The following sample code (from the link-rename-action.xqy module) shows how to use cpf:success and cpf:failure in a try/catch expression.

xquery version "1.0-ml";
import module namespace cpf = "http://marklogic.com/cpf"
   at "/MarkLogic/cpf/cpf.xqy";
import module namespace lnk = "http://marklogic.com/cpf/links"
   at "/MarkLogic/cpf/links.xqy";

declare variable $cpf:document-uri as xs:string external;
declare variable $cpf:transition as node() external;

try {
   lnk:propagate-rename( $cpf:document-uri )
   ,
   cpf:success( $cpf:document-uri, $cpf:transition, () )
}
catch ($e) {
   cpf:failure( $cpf:document-uri, $cpf:transition, $e, () )
}

Action Modules Operate On a Single Document

Use care if your action modules modify multiple documents. As a general rule, action modules should only modify the document being processed; they should not modify any other documents without fully understanding the implications. Creating side effects by modifying other documents within a single transaction can cause triggers to fire on updates, which can (potentially) cause multiple updates to the same document to be queued in the task server. Therefore, you should only modify other documents if you fully understand the consequences (or if you are sure there are no triggers on the other updates).

If your action modules modify multiple documents, you must design your application to handle the side effects. Each time a document that is in the scope of a domain is updated, a trigger fires, which can initiate a new set of processing. If your application must do this, make sure you carefully think through the side effects that will occur.

Action Modules Must Be a Single Transaction

An action module must execute as a single transaction; it should not update the document more than once. Avoid using xdmp:eval or xdmp:invoke to run other transactions from within an action module. The Content Processing Framework assumes that action modules will perform a single transaction.

If you do perform multiple update transactions in an action module, you should understand the implications. Transitions are initiated as transactions complete. Having multiple transactions complete in a single action module can lead to multiple transitions being initiated in parallel, leading to unpredictable results. While it is possible to do this, your application design must take this into account. For more details, see Design Patterns For Content Applications.

Using XSLT Stylesheets Instead of Action Modules

If you specify a path to an XSLT stylesheet instead of a path to an XQuery module in a pipeline, then the framework will invoke the stylesheet with the appropriate variables. For an action module, the stylesheet must return the new contents of the document. The framework takes care of invoking the stylesheet for you, and it takes care of the other mechanics (the try/catch and calling cpf:success or cpf:failure); your stylesheet should not directly perform any updates, as the framework will take care of that for you. Note that this is different from XQuery actions, which actually perform the update. If your stylesheet produces multiple result documents in an action module, the first result document is the content that updates the document under CPF control. Any subsequent result documents in the stylesheet output are saved to the database at the URIs specified in the stylesheet.

« Previous chapter
Next chapter »