The Default Conversion Option installation installs the Content Processing Framework for your database, sets up the domain for the pipeline, loads the needed triggers into the triggers database, and performs other pipeline initialization tasks. You need to install the Default Conversion Option for each database in which you plan on using conversion.
triggers databaseto use with your database (for example, Triggers). You can use any database for the triggers database. It can be the same database as the one you are configuring (for example, you can set the Documents database as the triggers database for the Documents database) or it can be a different database (for example, the Triggers database created as part of the installation process).
enable conversionoption and click Install. Make sure
enable conversionis set to
true. If this is set to
false, then you will only install the Content Processing Framework, not the Default Conversion Option.
The Default Conversion Option is now installed for the database. The default domain determines which documents are processed, and by default it has a document scope that applies to any document in the database with a URI starting with a slash (
/ ).You can modify the domain settings if you want the Default Conversion Option to apply to a different set of documents. To modify the domain settings, click the default domain for your database (for example, Default Documents if you chose the Documents database) on the Content Processing Summary pages and make the needed modifications. For details on domains, see Understanding and Using Domains.
To try out the pipeline, you need to load some Adobe PDF, Microsoft Office, and/or HTML files into the database. You can load the documents using any method you like. This section describes an easy way to load documents using a WebDAV server and client. You can then use this configuration to test document conversion with the Default Conversion Option. For more information on WebDAV servers, see 'WebDAV Servers' in the Administrator's Guide.
/that accesses the database in which you installed the Content Processing Pipeline.
/for the root.
The converted documents, as well as the original documents and any parts generated as part of the conversion, will appear in the WebDAV folder. If you have large documents or if you load many documents into the database, the processing might continue for several minutes or longer.
*_parts) containing various parts generated as part of the conversion process. The parts are typically any images that were in the original document, a cascading style sheet document (
conv.css), and a document containing an analysis of the stylesheet (
css.xml). PDF documents also include toc.xml, which is an analysis of the table of contents structure.
The Default Conversion Option uses the components of Content Processing Framework, as well as converters to create XML documents from Microsoft Office and Adobe PDF files, to create a unified conversion process which converts Microsoft Office, Adobe PDF, and HTML files to well-structured XHTML and simplified DocBook format XML documents. This section provides some background on how the default conversion process works, and includes the following sections:
For details on these functions, see the MarkLogic XQuery and XSLT Function Reference.
The Default Conversion Option only converts Microsoft Office 97 and newer documents; it cannot convert Microsoft Office 95 and earlier documents. If you try to convert Microsoft Word 95 or older documents (or other Microsoft Office 95 documents), the conversion will fail, putting the document in the
http://marklogic.com/states/error state. If this happens, you can do the following:
There are other types of errors you might get with Microsoft Office documents. For example, if a document is password protected, the conversion will fail because it needs the password to open the document. In general, you can address these types of issues by opening the document in the appropriate Microsoft Office application, changing the cause of the error (for example, removing the password protection), re-saving the document, and reloading the document into the conversion domain.
When you add documents to the database for conversion, the user who adds the documents must have the needed permissions to add and modify documents. If you are using WebDAV server to drag-and-drop documents into the database, the root directory of the WebDAV server must also have the needed permissions.
webdav, and the root directory has the URI
/webdav/root/, run a query (as a privileged user) similar to the following:
xdmp:document-set-permissions("/webdav/root/", ( xdmp:permission("webdav", "read"), xdmp:permission("webdav", "insert"), xdmp:permission("webdav", "update") ) )
webdavin the example above) to the user who accesses the WebDAV server.
If you are using a collection in the domain to specify which documents to convert, the new documents created by the conversion process must be created as part of the collection specified in the domain. You can do this in the following ways:
inherit collectionsoption at the database level to
trueand make sure the parent directory belongs to the collection.
Otherwise only the first phase of conversion will occur (because documents created during the conversion process will not be part of the collection specified in the domain). Similarly, you must have either the appropriate default permissions assigned to the user (or a role to which the user is assigned) or you should set the permissions to inherit at the database level.
If you need debugging capabilities, you can set trace events on the server for the Content Processing Framework. For details, see Debugging and Recovering from Error Conditions.
If you have special error handling needs, you can always extend the Default Conversion Option application by adding your own custom error handling pipeline. For details on pipelines and creating custom code, see Understanding and Using Pipelines and Using the Framework to Create Custom Applications.
You can create your own pipelines by copying and modifying the Default Conversion Option code to suit your needs. Make sure you understand domains, pipelines, the concepts of the Content Processing Framework, and the rules for XQuery modules in content processing applications before modifying the pipelines. For information on these topics, see the rest of this document.
The modification possibilities are endless. You can add phases to the pipeline to do your own processing, add email notification to your application, add entity extraction from a semantic tagging service, and so on. For information on creating custom applications, see Using the Framework to Create Custom Applications.
There are several alternate PDF pipelines available to attach to a domain instead of the default PDF pipeline. The Default Conversion Option is designed to have only one PDF pipeline attached to a domain at a time; do not attach several alternate PDF pipelines to the same domain. The following table lists the PDF pipelines with a description of each (choose the one that best matches your needs).
|PDF Conversion||This is the default pipeline. It is set up to extract the most out of the text in PDF files, concentrating more on structure than page layout. It is not optimized for page layout fidelity.|
|PDF Conversion (Page Layout)||This pipeline preserves the page layout fidelity as closely as possible. It converts the PDF file on a page-by-page basis, producing a new element for each page. This can make it difficult to render logical parts of the document together, unless they correspond to page breaks, but makes it easy to process the document one page at a time. You cannot do the docbook processing (with the DocBook Conversion pipeline) when using this pipeline.|
|PDF Conversion (Page Layout with Reblocking)||This pipeline attempts to preserve more exact rendering of content then the default PDF pipeline, while making it more feasible to extract and render logical subunits instead of just pages. It is possible to lose some page-layout fidelity with this pipeline compared withe the Page Layout pipeline. This pipeline requires more processing time than the others.|
|PDF Conversion (Page Layout, Image Batching)|
This pipeline preserves the page layout fidelity as closely as possible, and also batch-processes images. It converts the PDF file on a page-by-page basis, producing a new element for each page. This can make it difficult to render logical parts of the document together, unless they correspond to page breaks, but makes it easy to process the document one page at a time. You cannot do the docbook processing (with the DocBook Conversion pipeline) when using this pipeline.
This pipeline is intended to minimize the memory requirements when converting PDF documents containing many large images (for example, scanned PDF files). The overall processing time will increase, however, because of the additional image extraction steps. You can adjust the
|PDF Conversion (Image Batching)||This pipeline performs the extraction of the default pipeline and also extracts images in batches, increasing the overall processing time but decreasing the memory usage. You can adjust the |
|PDF Conversion (Paged Text, No Rendering)||This pipeline produces a very simple XML structure that is suitable for word searches within pages, but not rendering. Paragraph and section structure is lost. By default images are not extracted. This variant emphasizes speed of extraction and word search. To make best use of the output of this pipeline, define the elements |
The Default Conversion Option uses the built-in functions xdmp:excel-convert, xdmp:pdf-convert,
xdmp:powerpoint-convert, xdmp:tidy, and xdmp:word-convert. The pipelines reference various XQuery modules that call these functions. Each of these functions takes an options node to control its behavior. The options are set to somewhat generic defaults that work well with a large variety of documents. Your own documents might have some more specific needs, however, and the pipelines are designed with the ability to pass in options nodes which specify conversion options.
Each condition and action step in the MarkLogic pipelines has an
options node. The options node is defined in a namespace with a URI corresponding to the module path invoked by that step. In these option nodes, you can enter options from any of the Document Conversion functions (
xdmp:excel-convert, xdmp:pdf-convert, xdmp:powerpoint-convert, xdmp:tidy, xdmp:word-convert) in the namespace corresponding to that conversion function.
<action> <module>/MarkLogic/conversion/actions/convert-pdf-action.xqy</module> <options xmlns="/MarkLogic/conversion/actions/convert-pdf-action.xqy"> <destination-root/> <wrap xmlns="xdmp:tidy">0</wrap> <tidy-mark xmlns="xdmp:tidy">false</tidy-mark> <show-warnings xmlns="xdmp:tidy">false</show-warnings> </options> </action>
In this options node, the
destination-root is an option for this pipeline step. The
wrap element is an option passed into the xdmp:tidy built-in function (which the PDF Pipeline uses to clean the generated xhtml), and the
show-warnings elements are also options passed into tidy.
All of the pipelines have options nodes, and you can pass in any option to each pipeline. You can change the default options or add other options that make sense for your content. See the MarkLogic XQuery and XSLT Function Reference for the options for each of the Document Conversion functions. Also, the format conversion steps in a pipeline have the following options:
destination-root option specifies an alternate directory URI where the output of the conversion processing is saved. The
default-language option is only used on the Microsoft Office conversion pipelines, and it specifies the value of an
xml:lang attribute to put on the root node of the converted Office documents.