This chapter describes ways to use the MarkLogic Server Content Processing Framework to create custom applications that run your own modules. It includes the following sections:

Install Content Processing Framework in Database
Decide on the Stages and Logic for Your Processing
Create and Load Your Pipelines
Create and Load Your Modules
Design Patterns For Content Applications
Microsoft Office 2007 and Later Documents
Other CPF Applications Included With MarkLogic Server

Install Content Processing Framework in Database

You must install the Content Processing Framework in each database that contains documents you want to process. This section describes the procedure for installing the Status Change Handling Pipeline in a database, which enables you to create applications using the Content Processing Framework.

If you are using the Default Conversion Option, follow the installation steps in The Default Conversion Option instead of the steps in this section. The installation process for the Default Conversion Option includes the installation of the Status Change Handling Pipeline, so the following procedure is not needed.

Perform the following steps to install the Content Processing Framework in a database:

If it is not already installed, install MarkLogic Server.
Open the Admin Interface to the database page for the database in which you want to install the Content Processing Framework.
On the database configuration page, select a triggers database to use with your database (for example, Triggers). You can use any database for the triggers database. It can be the same database as the one you are configuring (for example, you can set the Documents database as the triggers database for the Documents database) or it can be a different database (for example, the Triggers database created as part of the installation process).
Click OK to apply the changes to the database configuration.
Click the Content Processing link under the database to which you want to install the Content Processing Framework. The Content Processing Summary page appears.
On the Content Processing Summary page, click the Install tab. The Content Processing Installation page appears.
On the Content Processing Installation page, select false for the enable conversion option and click Install.
If you have purchased the Default Conversion Option and select true for the enable conversion option, the default conversion pipelines and domain will be installed (in addition to the Content Processing Framework). If you have not purchased the Default Conversion Option, the enable conversion option will be greyed out.
Click OK to confirm the installation of content processing in your database.
When the installation is complete, the Content Processing Summary page appears. It should show content processing installed in your database.
On Content Processing Summary page, click the default domain for your database (for example, Default Documents if you chose the Documents database).
On the Domain Configuration Page, modify any of the default values as needed to partition the data to which your processing will occur. The default values set the document scope to include any documents whose URI begins with a / and to use modules that are in the database defined as the modules database whose URI begins with a /. For details on setting the domain values, see Creating and Modifying Domains.
Attach any relavant pipelines to the domain, as described in Attaching and Detaching Pipelines to Domains.

The Content Processing Framework is now installed for the database.

Decide on the Stages and Logic for Your Processing

The Content Processing Framework makes it easy to set up content processing applications that have multiple processing stages. The number of stages you need depends on the processing you need to do. Also, the number of passes you need to make through a document might contribute to the number of stages needed in your application because each stage can only result in a single update transaction to the document (see Action Modules Must Be a Single Transaction).

Pipeline actions are typically called based on a condition, so you need to decide on the logic for your conditions as well as your actions.

Create and Load Your Pipelines

You must create and load any pipelines to define your content processing. Pipelines are XML documents that describe the conditions, actions, and states for your content processing application. For details on how to create and load pipelines, see Understanding and Using Pipelines.

Create and Load Your Modules

You must develop XQuery modules for any conditions or actions referenced in your pipelines. The XQuery modules are where the work of your content processing occurs. For the rules about condition and action modules, see Developing Modules to Process Content.

You should store your modules in the modules database and document root configured in the evaluation context for the domain to which the pipeline is attached.

Design Patterns For Content Applications

The Content Processing Framework uses post-commit triggers to move content processing from one stage to another. Consequently, applications with complex sets of pipelines process certain things asynchronously. When you are designing applications with the Content Processing Framework, it is important to think through the consequences of this.

The following are example scenarios which can cause asynchronous processing to occur in applications:

Suppose you update documents B and C while you are processing document A. Because you cannot guarantee which update will finish first, and which trigger action will execute first, document B might end up being processed before document C, or it might end up being processed after document C.
Suppose documents B and C are both processing, and each process ends up changing a property on document A to a different value. After both transactions complete, the value of the property on document A will depend on whether document B or document C committed its processing first. Also, each update to document A will trigger an update action on document A.
Suppose a delete of document A triggers an action to occur which cleans up links to document A. Before the action transaction occurs, however, suppose document A is created again, triggering a different action. In this case, it would be a good idea for the delete action to check and make sure the document still does not exist before cleaning up links.

These scenarios are not necessarily bad, but they can cause unexpected behavior if you do not properly understand them at application design time. Your applications either need to avoid these types of scenarios or they need to be designed to handle them.

Microsoft Office 2007 and Later Documents

Microsoft Office versions 2007 and later use a zip format called Office Open XML (OOXML) to package documents, and inside the zip file is XML content. MarkLogic Server includes the ability to zip and unzip documents directly from XQuery. The zip APIs are xdmp:zip-create, xdmp:zip-get, and xdmp:zip-manifest. You can use these APIs to write applications that use OOXML or later content in a MarkLogic Server database.

There is a pipeline installed with CPF called the Office OpenXML Extract pipeline. This pipeline unzips and extracts documents with a .docx file extension and then saves the extracted XML documents in the database.

There is another pipeline called WordprocessingML Process which takes the document.xml part of the extracted *.dotx documents and does some processing on it to make it more searchable. The document.xml part of the extracted OOXML document sometimes breaks words into multiple elements, and this pipeline and its associated actions put the broken words back together, which makes them easier to search.

You can create custom pipelines to process other OOXML documents that perform similar actions to the other OpenXML pipelines. Because OOXML documents are already XML documents, you can do many things with them.

Other CPF Applications Included With MarkLogic Server

In addition to the conversion application (see The Default Conversion Option), there are several other CPF applications that ship with MarkLogic Server. The following are the pipelines for these applications:

Alerting (for alerting applications, as described in Creating Alerting Applications in the Search Developer's Guide)
XInclude Processing (for modular documents, as described in Reusing Content With Modular Document Applications in the Application Developer's Guide)
Entity Enrichment (for finding entities in a document and enriching the XML with markup, as described in Entity Extraction and Enrichment in the Search Developer's Guide)

These applications are all designed to be used together, if you desire. To use these applications together, simply attach any of these the pipelines that you want to run (along with the Status Change Handling pipeline) to a domain. They will execute in the following order: conversion, entity enrichment, modular documents, and finally alerting.

Additionally, there are some sample applications that use CPF. The sample applications are for demonstration purposes only, and are not designed to be put into production; see the samples-license.txt file in the <marklogic-dir>/Samples directory for more information.

MarkLogic Server 11.0 Product Documentation
Content Processing Framework Guide — Chapter 6

Using the Framework to Create Custom Applications