Loading TOC...
Loading Content Into MarkLogic Server (PDF)

Loading Content Into MarkLogic Server — Chapter 9

Modifying Content During Loading

Content can go through many stages before it is ready to use in an application. These stages might include modifying the content so that it is well-formed XML, transforming one XML structure to another, or combining the content with other content or information. The process of content going from one stage to another is called content processing.

Content processing can be very simple or extremely complex. You might decide to add a timestamp to a document and define a content processing stage to add the timestamp. You might have a process that translates the text from one language to another. Often, many of these stages combined together form an overall set of content processing work you need to do on a document.

While the range of problems that can be addressed is virtually unlimited, there are several core content processing capabilities required to address many of the wide-ranging issues:

  • The ability to change the content from one form to another.
  • The ability to tie together different pieces of content processing.
  • The ability to separate different documents for different types of processing.
  • The ability to automate the entire procedure so documents can move through complex processing phases automatically.
  • The ability to integrate manual steps or long-running, asynchronous operations in applications.

Flexibility is important in content processing, as both the starting points of documents and their end results can vary significantly. Also, application requirements can evolve over time, forcing the content processing application to change with the requirements. It is therefore necessary to have a content processing environment that can allow for such change.

MarkLogic Server provides capabilities to modify content with workflows and pipelines. An example of a content processing application is The Default Conversion Option, which uses the components of the MarkLogic Content Processing Framework, and XQuery modules, to create a unified conversion process that converts Microsoft Office, Adobe PDF, and HTML files to well-structured XHTML and simplified DocBook format XML documents.

Converting Microsoft Office and Adobe PDF Into XML

The Default Conversion Option of the Content Processing Framework converts Microsoft Office, Adobe PDF, and HTML files to XHTML and DocBook. The Default Conversion Option only converts Microsoft Office 97 and newer documents; it cannot convert documents from Microsoft Office 95 or earlier.

Converting to XHTML

MarkLogic provides facilities for converting documents to XHTML as follows:

  • xdmp:tidy converts HTML to XHTML
  • Default Conversion Option of the Content Processing Framework converts Microsoft Office, PDF, and HTML files to XHTML
  • xdmp:pdf-convert converts a PDF file to XHTML
  • xdmp:excel-convert converts a Microsoft Excel document to XHTML

Automating Metadata Extraction

MarkLogic provides facilities to extract and associate metadata from binary documents as follows:

  • xdmp:document-filter, a built-in XQuery function
  • MarkLogic content pump provides commands to include or exclude metadata during copy, export, and import

Transforming XML Structures

A common task sometimes required with XML is to transform one structure to another structure. A design pattern using the XQuery typeswitch expression to transform XML to XHTML or XSL-FO is described in Transforming XML Structures With a Recursive typeswitch Expression in the Application Developer's Guide.

« Previous chapter
Next chapter »