This chapter describes ways to use the MarkLogic Server Content Processing Framework to create custom applications that run your own modules. It includes the following sections:
You must install the Content Processing Framework in each database that contains documents you want to process. This section describes the procedure for installing the Status Change Handling Pipeline in a database, which enables you to create applications using the Content Processing Framework.
If you are using the Default Conversion Option, follow the installation steps in The Default Conversion Option instead of the steps in this section. The installation process for the Default Conversion Option includes the installation of the Status Change Handling Pipeline, so the following procedure is not needed.
Perform the following steps to install the Content Processing Framework in a database:
triggers database
to use with your database (for example, Triggers). You can use any database for the triggers database. It can be the same database as the one you are configuring (for example, you can set the Documents database as the triggers database for the Documents database) or it can be a different database (for example, the Triggers database created as part of the installation process).false
for the enable conversion
option and click Install. If you have purchased the Default Conversion Option and select true
for the enable conversion
option, the default conversion pipelines and domain will be installed (in addition to the Content Processing Framework). If you have not purchased the Default Conversion Option, the enable conversion
option will be greyed out.
/
and to use modules that are in the database defined as the modules database whose URI begins with a /
. For details on setting the domain values, see Creating and Modifying Domains.The Content Processing Framework is now installed for the database.
The Content Processing Framework makes it easy to set up content processing applications that have multiple processing stages. The number of stages you need depends on the processing you need to do. Also, the number of passes you need to make through a document might contribute to the number of stages needed in your application because each stage can only result in a single update transaction to the document (see Action Modules Must Be a Single Transaction).
Pipeline actions are typically called based on a condition, so you need to decide on the logic for your conditions as well as your actions.
You must create and load any pipelines to define your content processing. Pipelines are XML documents that describe the conditions, actions, and states for your content processing application. For details on how to create and load pipelines, see Understanding and Using Pipelines.
You must develop XQuery modules for any conditions or actions referenced in your pipelines. The XQuery modules are where the work of your content processing occurs. For the rules about condition and action modules, see Developing Modules to Process Content.
You should store your modules in the modules database and document root configured in the evaluation context
for the domain to which the pipeline is attached.
The Content Processing Framework uses post-commit triggers to move content processing from one stage to another. Consequently, applications with complex sets of pipelines process certain things asynchronously. When you are designing applications with the Content Processing Framework, it is important to think through the consequences of this.
The following are example scenarios which can cause asynchronous processing to occur in applications:
These scenarios are not necessarily bad, but they can cause unexpected behavior if you do not properly understand them at application design time. Your applications either need to avoid these types of scenarios or they need to be designed to handle them.
Microsoft Office versions 2007 and later use a zip format called Office Open XML (OOXML) to package documents, and inside the zip file is XML content. MarkLogic Server includes the ability to zip and unzip documents directly from XQuery. The zip APIs are xdmp:zip-create, xdmp:zip-get, and xdmp:zip-manifest. You can use these APIs to write applications that use OOXML or later content in a MarkLogic Server database.
There is a pipeline installed with CPF called the Office OpenXML Extract
pipeline. This pipeline unzips and extracts documents with a .docx
file extension and then saves the extracted XML documents in the database.
There is another pipeline called WordprocessingML Process
which takes the document.xml
part of the extracted *.dotx
documents and does some processing on it to make it more searchable. The document.xml
part of the extracted OOXML document sometimes breaks words into multiple elements, and this pipeline and its associated actions put the broken words back together, which makes them easier to search.
You can create custom pipelines to process other OOXML documents that perform similar actions to the other OpenXML pipelines. Because OOXML documents are already XML documents, you can do many things with them.
In addition to the conversion application (see The Default Conversion Option), there are several other CPF applications that ship with MarkLogic Server. The following are the pipelines for these applications:
Alerting
(for alerting applications, as described in Creating Alerting Applications in the Search Developer's Guide)XInclude Processing
(for modular documents, as described in Reusing Content With Modular Document Applications in the Application Developer's Guide)Entity Enrichment
(for finding entities in a document and enriching the XML with markup, as described in Entity Extraction and Enrichment in the Search Developer's Guide)These applications are all designed to be used together, if you desire. To use these applications together, simply attach any of these the pipelines that you want to run (along with the Status Change Handling
pipeline) to a domain. They will execute in the following order: conversion, entity enrichment, modular documents, and finally alerting.
Additionally, there are some sample applications that use CPF. The sample applications are for demonstration purposes only, and are not designed to be put into production; see the samples-license.txt
file in the <
marklogic-dir>/Samples
directory for more information.