You must install the Content Processing Framework in each database that contains documents you want to process. This section describes the procedure for installing the Status Change Handling Pipeline in a database, which enables you to create applications using the Content Processing Framework.
If you are using the Default Conversion Option, follow the installation steps in The Default Conversion Option instead of the steps in this section. The installation process for the Default Conversion Option includes the installation of the Status Change Handling Pipeline, so the following procedure is not needed.
triggers databaseto use with your database (for example, Triggers). You can use any database for the triggers database. It can be the same database as the one you are configuring (for example, you can set the Documents database as the triggers database for the Documents database) or it can be a different database (for example, the Triggers database created as part of the installation process).
enable conversionoption and click Install.
If you have purchased the Default Conversion Option and select
true for the
enable conversion option, the default conversion pipelines and domain will be installed (in addition to the Content Processing Framework). If you have not purchased the Default Conversion Option, the
enable conversion option will be greyed out.
/and to use modules that are in the database defined as the modules database whose URI begins with a
/. For details on setting the domain values, see Creating and Modifying Domains.
The Content Processing Framework makes it easy to set up content processing applications that have multiple processing stages. The number of stages you need depends on the processing you need to do. Also, the number of passes you need to make through a document might contribute to the number of stages needed in your application because each stage can only result in a single update transaction to the document (see Action Modules Must Be a Single Transaction).
You must create and load any pipelines to define your content processing. Pipelines are XML documents that describe the conditions, actions, and states for your content processing application. For details on how to create and load pipelines, see Understanding and Using Pipelines.
You must develop XQuery modules for any conditions or actions referenced in your pipelines. The XQuery modules are where the work of your content processing occurs. For the rules about condition and action modules, see Developing Modules to Process Content.
The Content Processing Framework uses post-commit triggers to move content processing from one stage to another. Consequently, applications with complex sets of pipelines process certain things asynchronously. When you are designing applications with the Content Processing Framework, it is important to think through the consequences of this.
These scenarios are not necessarily bad, but they can cause unexpected behavior if you do not properly understand them at application design time. Your applications either need to avoid these types of scenarios or they need to be designed to handle them.
Microsoft Office 2007 uses a zip format to package up documents, and inside the zip file is XML content. MarkLogic Server includes the ability to zip and unzip documents directly from XQuery. The zip APIs are xdmp:zip-create, xdmp:zip-get, and xdmp:zip-manifest. You can use these APIs to write applications that use Word 2007 (or other Office 2007) content in a MarkLogic Server database.
There is a pipeline installed with CPF called the
Office OpenXML Extract pipeline. This pipeline unzips and extracts Word 2007 documents (with a
.docx file extension) and then saves the extracted XML documents in the database.
There is another pipeline called
WordprocessingML Process which takes the
document.xml part of the extracted Word 2007 documents and does some processing on it to make it more searchable. The
document.xml part of the extracted Word 2007 document sometimes breaks words into multiple elements, and this pipeline and its associated actions put the broken words back together, which makes them easier to search.
You can create custom pipelines to process other Office 2007 documents that perform similar actions to the other OpenXML pipelines. Because Office 2007 documents are already XML documents, you can do many things with them.
In addition to the conversion application (see The Default Conversion Option), there are several other CPF applications that ship with MarkLogic Server. The following are the pipelines for these applications:
Alerting(for alerting applications, as described in Creating Alerting Applications in the Search Developer's Guide)
XInclude Processing(for modular documents, as described in Reusing Content With Modular Document Applications in the Application Developer's Guide)
Entity Enrichment(for finding entities in a document and enriching the XML with markup, as described in Marking Up Documents With Entity Enrichment in the Search Developer's Guide)
These applications are all designed to be used together, if you desire. To use these applications together, simply attach any of these the pipelines that you want to run (along with the
Status Change Handling pipeline) to a domain. They will execute in the following order: conversion, entity enrichment, modular documents, and finally alerting.
Additionally, there are some sample applications that use CPF. The sample applications are for demonstration purposes only, and are not designed to be put into production; see the
samples-license.txt file in the
>/Samples directory for more information.