Content Processing Framework Guide — Chapter 1

Overview of the Content Processing Framework

Content processing applications often require multi-step processing. Each step in the process performs a particular task or set of tasks. The Content Processing Framework in MarkLogic Server supports these types of multi-step conversion processes. This chapter provides an overview of the framework. This chapter includes the following topics:

Making Content More Useful
Access Internal and External Web Services
Components of the Content Processing Framework
Default Conversion Option

For instructions on installing the Default Conversion Option, which converts HTML, Adobe PDF, and Microsoft Office documents to XML, see Installing the Conversion Pipelines and Framework.

Making Content More Useful

Content is information stored in a variety of forms. It could be stored on paper, in various proprietary electronic formats, or in XML format. For many businesses, content is their most valuable asset. In many cases, however, it is very difficult to extract information out of that content because of a variety of reasons, including:

The content is difficult to access.
It is difficult to compare with other content.
You might not know the content exists.

In short, it is difficult to get the most use out of the content that already exists. Content processing is a way to programmatically add value to your content by helping to alleviate some of these difficulties.

The overall goal of content processing is to make your content more useful. Sometimes that means adding value to it by combining it with other content, and sometimes it means making it more accessible to more people. By moving the content through a series of content processing steps, you can add value to it at each step of the process.

Getting Your Content Into XML Format

In order to efficiently add value to your content, the first step is to get the content into a marked up XML format. In many cases the content is already in XML format, but sometimes it needs to undergo a conversion process to become XML. For example, if your content is in HTML, Microsoft Word, or Adobe PDF format, you can convert it to XML using the Default Conversion Option, as described in The Default Conversion Option.

Striving For Clean, Well-Structured XML

One of the important requirements for building rich content applications is to have content that is in a well-structured XML format and is free from errors. Having well-structured XML allows you to easily build applications that can combine content from many documents to produce new and unique documents that are based on some dynamic criteria. In order to get well-structured XML, however, you typically have to perform some processing on the source content to transform it to XML, and then transform the XML to the structure you need.

The requirements for content conversion and transformation are often very specific to an organization, and can require a custom solution. MarkLogic Server allows you to build custom pipelines that exactly match your specific requirements. The Content Processing Framework provides the tools needed to build your own custom pipeline.

Enriching Content With Semantic Tagging, Metadata, etc.

One way to make your content more useful is to add information to it. For example, you can add semantic tagging, which identifies and tags meaningful concepts within the content. The markup from the semantic tagging creates a more uniform target for search and retrieval applications.

There are many web services designed to add semantic tagging to XML documents. For example, a web service might perform entity extraction on an XML document, which identifies people, places, and other entities within keywords in documents. You can write custom applications to add this type of functionality or you can use one of many third-party products and services, often available as web services, to supplement your existing content.

You can also add metadata to the content. Metadata is any kind of information about the document. It can be simple, such as the date the document was created, or it can be more involved, such as information on how the document came into being. For example, MarkLogic Server includes options to maintain metadata (such as the last modified date) on documents, and the Default Conversion Option maintains metadata for all of the content it processes.

The Content Processing Framework makes it easy to add these services as steps in your content processing application.

Access Internal and External Web Services

You can access web services, both within an intranet and anywhere across the internet, with the XQuery-level HTTP functions built into MarkLogic Server. The HTTP functions allow you to perform HTTP operations such as GET, PUT, POST, and DELETE. You can access these functions directly through XQuery, thus allowing you to post or get content from any HTTP server, including the ability to communicate with web services. The web services that you communicate with can perform external processing on your content, such as entity extraction, language translation, or some other custom processing. Combined with the conversion and HTML Tidy functions, the HTTP functions make it very easy to process any content you can get to on the web within MarkLogic Server.

The XQuery-level HTTP functions can also be used directly with xdmp:document-load, xdmp:document-get, and all of the conversion functions. You can then, for example, directly process content extracted via HTTP from the web and process it with HTML Tidy (xdmp:tidy), load it into the database, or do anything you need to do with any content available via HTTP.

MarkLogic Server requires all data to be in UTF-8 format. Therefore, any web services or HTTP services you access must return data UTF-8 format, or an error occurs.

Components of the Content Processing Framework

This section describes the components of the Content Processing Framework and includes the following sections:

Domains
Pipelines
XQuery Functions and Modules
Pre-Commit and Post-Commit Triggers
Creating Custom Applications With the Content Processing Framework

Domains

A domain defines a scope of documents to process. With domains, you can organize your content so that some documents are processed in one way and others are processed in another way. Domains provide flexibility in your content processing applications, making it easy to demarcate sets of documents to which you want to apply the same processing. For more details about domains, see Understanding and Using Domains.

Pipelines

A pipeline is an XML document that describes a set of content processing steps. A pipeline defines the steps that occur during the processing of documents and defines actions that occur at each step. After each step, the document being processed is committed to the database, and the Content Processing Framework then catches this document change event with a trigger, which in turn can execute some more processing. This process continues for as many steps as you have defined in the pipeline. There are XQuery functions to help you define and install pipelines. For more details on pipelines, see Understanding and Using Pipelines.

XQuery Functions and Modules

As part of the Content Processing Framework, MarkLogic Server includes many XQuery functions and supporting XQuery modules. The following functions are built into MarkLogic Server; they support document conversion and other applications you can build with the Content Processing Framework:

Additionally, there are XQuery modules to support the Content Processing Framework. Each module has XQuery functions designed to use in content processing applications.

The following modules support the Content Processing Framework:

Triggers
Content Processing Framework
Links
Domains
Pipelines

These XQuery modules include the XQuery source code, so you can analyze them and use their functions in your own applications. The XQuery modules are installed into the following directory:

<install_dir>/Modules/MarkLogic

For details on these functions, see the MarkLogic XQuery and XSLT Function Reference.

The convert functions and the conversion XQuery modules require the Default Conversion Option to run.

Pre-Commit and Post-Commit Triggers

MarkLogic Server and the Content Processing Framework use triggers to automate processes that are described by a pipeline. Triggers allow you to capture document events (create, update, delete, or property change) and system events (database online), and then perform some tasks after the event occurs. The tasks you perform are defined by an XQuery module, and can therefore be anything you can do in XQuery. The triggers capture the document change events for documents under a domain. Some of the triggers used by the Content Processing Framework are pre-commit triggers (that is, they execute before the transaction completes) and some and are post-commit triggers (that is, they execute after the transaction commits). When you construct pipelines for processing your content, triggers automate the transitions between stages of the pipeline. For more details about triggers, see the chapter on triggers in the Application Developer's Guide.

Creating Custom Applications With the Content Processing Framework

The Content Processing Framework is designed for creating your own content processing applications, with your own content processing code, and with pipelines that follow your own logical and business processes.

For details on creating custom applications with the Content Processing Framework, see Using the Framework to Create Custom Applications.

Default Conversion Option

The Default Conversion Option includes features to perform the following types of document conversion:

Microsoft Office and Adobe PDF Conversion
HTML Conversion and Enrichment

For details on the Default Conversion Option, see The Default Conversion Option.

Microsoft Office and Adobe PDF Conversion

Included with the Default Conversion Option is the ability to convert Microsoft Word, Excel, and Powerpoint documents, as well as Adobe PDF documents, to XML. You can access this functionality directly through XQuery. The Default Conversion Option, which is written in XQuery, uses the conversion functions to do the initial conversion to XML from Adobe PDF and Microsoft Office documents.

The Default Conversion Option requires Microsoft Office documents to be Office 97 to Office 2003 format; it does not convert Office 95 documents or older documents. To convert documents written in Microsoft 2007 or later format (Office Open XML), follow the steps in Microsoft Office 2007 and Later Documents.

HTML Conversion and Enrichment

HTML documents contain HTML markup, but are not required to be well-formed and are not XML. XHTML is an XML version of HTML that requires well-formed tags, is case-sensitive for the element and attribute names, and is more strict in its structured requirements than HTML.

MarkLogic Server includes an implementation of HTML Tidy, which is an open source project (http://tidy.sourceforge.net/) that allows you to take any HTML document and convert it to well-formed XHTML. This provides XQuery-level access to Tidy, with the processing done natively within MarkLogic Server.

In addition to simply converting the HTML documents to XHTML, the XHTML is significantly enriched, improving its structure. The enriched structure makes it easier to create robust applications with the content.

« Table of contents

Next chapter »