This chapter describes the Default Conversion Option, which is designed to convert Microsoft Office, Adobe PDF, and HTML files to XHTML and DocBook. It includes the following sections:
The Default Conversion Option installation installs the Content Processing Framework for your database, sets up the domain for the pipeline, loads the needed triggers into the triggers database, and performs other pipeline initialization tasks. You need to install the Default Conversion Option for each database in which you plan on using conversion.
Complete the following steps to install the Default Conversion Option into a database.
MarkLogic recommends creating a new database to use when testing the Default Conversion Option.
triggers database
to use with your database (for example, Triggers). You can use any database for the triggers database. It can be the same database as the one you are configuring (for example, you can set the Documents database as the triggers database for the Documents database) or it can be a different database (for example, the Triggers database created as part of the installation process).true
for the enable conversion
option and click Install. Make sure enable conversion
is set to true
. If this is set to false
, then you will only install the Content Processing Framework, not the Default Conversion Option.The Default Conversion Option is now installed for the database. The default domain determines which documents are processed, and by default it has a document scope that applies to any document in the database with a URI starting with a slash ( /
).You can modify the domain settings if you want the Default Conversion Option to apply to a different set of documents. To modify the domain settings, click the default domain for your database (for example, Default Documents if you chose the Documents database) on the Content Processing Summary pages and make the needed modifications. For details on domains, see Understanding and Using Domains.
To try out the pipeline, you need to load some Adobe PDF, Microsoft Office, and/or HTML files into the database. You can load the documents using any method you like. This section describes an easy way to load documents using a WebDAV server and client. You can then use this configuration to test document conversion with the Default Conversion Option. For more information on WebDAV servers, see WebDAV Servers in the Administrator's Guide.
Complete the following steps to load and process documents in a database.
/
that accesses the database in which you installed the Content Processing Pipeline.CPF
)./
for the root.9999
).Documents
).http://localhost:9999
conversion
).The converted documents, as well as the original documents and any parts generated as part of the conversion, will appear in the WebDAV folder. If you have large documents or if you load many documents into the database, the processing might continue for several minutes or longer.
After the conversion process is finished, for each HTML, Word, Excel, Powerpoint, and PDF document you loaded, the Default Conversion Option produces the following:
*.xhtml
).xml
)*_parts
) containing various parts generated as part of the conversion process. The parts are typically any images that were in the original document, a cascading style sheet document (conv.css
), and a document containing an analysis of the stylesheet (css.xml
). PDF documents also include toc.xml, which is an analysis of the table of contents structure.The generated XHTML and XML documents have a URI that includes the suffix of the original document. For example, a document called word.doc
produces word_doc.xml
and word_doc.xhtml
.
The Default Conversion Option uses the components of Content Processing Framework, as well as converters to create XML documents from Microsoft Office and Adobe PDF files, to create a unified conversion process which converts Microsoft Office, Adobe PDF, and HTML files to well-structured XHTML and simplified DocBook format XML documents. This section provides some background on how the default conversion process works, and includes the following sections:
The Default Conversion Option does not support documents written in Microsoft 2007 or later format (Office Open XML). To convert these files, follow the steps in Microsoft Office 2007 and Later Documents.
The MarkLogic Converters package may generate temporary files. These temporary files are not supported by encryption at rest.
The Default Conversion Option includes the following components:
There are also supporting XQuery modules for the Default Conversion Option for the following:
These XQuery modules include the XQuery source code, so you can analyze them and use their functions in your own applications. The XQuery modules are installed into the following directory:
<install_dir>/Modules/MarkLogic/conversion
For details on these functions, see the MarkLogic XQuery and XSLT Function Reference.
The steps in the conversion process differ for the different document formats (Microsoft Office, Adobe PDF, and HTML). The steps are defined in the following pipelines:
Generally, the conversion process perform the following tasks:
The conversion states are defined in the pipelines and are stored in the properties document for each document. The conversion process includes the following states:
This section describes the following error and troubleshooting situations you might encounter with the Default Conversion Option:
The Default Conversion Option only converts documents written in Microsoft Office 97 to Microsoft 2003; it cannot convert Microsoft Office 95 and earlier documents. If you try to convert Microsoft Word 95 or older documents (or other Microsoft Office 95 documents), the conversion will fail, putting the document in the http://marklogic.com/states/error
state. If this happens, you can do the following:
Once you reload the documents into a database with content processing installed and configured, the new documents will be converted.
There are other types of errors you might get with Microsoft Office documents. For example, if a document is password protected, the conversion will fail because it needs the password to open the document. In general, you can address these types of issues by opening the document in the appropriate Microsoft Office application, changing the cause of the error (for example, removing the password protection), re-saving the document, and reloading the document into the conversion domain.
When you add documents to the database for conversion, the user who adds the documents must have the needed permissions to add and modify documents. If you are using WebDAV server to drag-and-drop documents into the database, the root directory of the WebDAV server must also have the needed permissions.
One simple way to accomplish these security requirements is to do the following:
webdav
, and the root directory has the URI /webdav/root/
, run a query (as a privileged user) similar to the following:xdmp:document-set-permissions("/webdav/root/", ( xdmp:permission("webdav", "read"), xdmp:permission("webdav", "insert"), xdmp:permission("webdav", "update") ) )
You can check the permissions with the following query:
xdmp:document-get-permissions("/webdav/root/")
webdav
in the example above) to the user who accesses the WebDAV server.If you are using a collection in the domain to specify which documents to convert, the new documents created by the conversion process must be created as part of the collection specified in the domain. You can do this in the following ways:
inherit collections
option at the database level to true
and make sure the parent directory belongs to the collection.Otherwise only the first phase of conversion will occur (because documents created during the conversion process will not be part of the collection specified in the domain). Similarly, you must have either the appropriate default permissions assigned to the user (or a role to which the user is assigned) or you should set the permissions to inherit at the database level.
For information on inherited collections and inherited permissions, see the Administrator's Guide. For information on permissions, see Security Guide.
If you need debugging capabilities, you can set trace events on the server for the Content Processing Framework. For details, see Debugging and Recovering from Error Conditions.
If you have special error handling needs, you can always extend the Default Conversion Option application by adding your own custom error handling pipeline. For details on pipelines and creating custom code, see Understanding and Using Pipelines and Using the Framework to Create Custom Applications.
This section describes ways to modify the Default Conversion Option, and includes the following subsections:
All of the XQuery code and all of the pipelines for the Default Conversion Option are installed with MarkLogic Server. The pipelines are installed in the following directory:
<install_dir>/Installer
The XQuery modules are installed under the Modules directory in the following location:
<install_dir>/Modules/MarkLogic/conversion/actions
You can create your own pipelines by copying and modifying the Default Conversion Option code to suit your needs. Make sure you understand domains, pipelines, the concepts of the Content Processing Framework, and the rules for XQuery modules in content processing applications before modifying the pipelines. For information on these topics, see the rest of this document.
The modification possibilities are endless. You can add phases to the pipeline to do your own processing, add email notification to your application, add entity extraction from a semantic tagging service, and so on. For information on creating custom applications, see Using the Framework to Create Custom Applications.
There are several alternate PDF pipelines available to attach to a domain instead of the default PDF pipeline. The Default Conversion Option is designed to have only one PDF pipeline attached to a domain at a time; do not attach several alternate PDF pipelines to the same domain. The following table lists the PDF pipelines with a description of each (choose the one that best matches your needs).
The Default Conversion Option uses the built-in functions xdmp:excel-convert, xdmp:pdf-convert, xdmp:powerpoint-convert
, xdmp:tidy, and xdmp:word-convert. The pipelines reference various XQuery modules that call these functions. Each of these functions takes an options node to control its behavior. The options are set to somewhat generic defaults that work well with a large variety of documents. Your own documents might have some more specific needs, however, and the pipelines are designed with the ability to pass in options nodes which specify conversion options.
Each condition and action step in the MarkLogic pipelines has an options
node. The options node is defined in a namespace with a URI corresponding to the module path invoked by that step. In these option nodes, you can enter options from any of the Document Conversion functions (xdmp:excel-convert
, xdmp:pdf-convert, xdmp:powerpoint-convert, xdmp:tidy, xdmp:word-convert) in the namespace corresponding to that conversion function.
For example, the action for the default PDF Pipeline (pdf-pipeline.xml
) has the following options node:
<action> <module>/MarkLogic/conversion/actions/convert-pdf-action.xqy</module> <options xmlns="/MarkLogic/conversion/actions/convert-pdf-action.xqy"> <destination-root/> <wrap xmlns="xdmp:tidy">0</wrap> <tidy-mark xmlns="xdmp:tidy">false</tidy-mark> <show-warnings xmlns="xdmp:tidy">false</show-warnings> </options> </action>
In this options node, the destination-root
is an option for this pipeline step. The wrap
element is an option passed into the xdmp:tidy built-in function (which the PDF Pipeline uses to clean the generated xhtml), and the tidy-mark
and show-warnings
elements are also options passed into tidy.
Suppose the PDF documents you want to convert all are password protected and all have the same password. You can then add the following to the options
node to specify a password:
<password xmlns="xdmp:pdf-convert">your_password</password>
Notice the namespace of the password
option is xdmp:pdf-convert, and this option will be passed in when the pipeline processing calls xdmp:pdf-convert.
All of the pipelines have options nodes, and you can pass in any option to each pipeline. You can change the default options or add other options that make sense for your content. See the MarkLogic XQuery and XSLT Function Reference for the options for each of the Document Conversion functions. Also, the format conversion steps in a pipeline have the following options:
The destination-root
option specifies an alternate directory URI where the output of the conversion processing is saved. The default-language
option is only used on the Microsoft Office conversion pipelines, and it specifies the value of an xml:lang
attribute to put on the root node of the converted Office documents.