MarkLogic 10 Product Documentation
xdmp:pdf-convert

xdmp:pdf-convert(
   $doc as node(),
   $filename as xs:string,
   [$options as (element()|map:map)?]
) as node()*

Summary

Converts a PDF file to XHTML. Returns several nodes, including a parts node, the converted document xml node, and any other document parts (for example, css files and images). The first node is the parts node, which contains a manifest of all of the parts generated as result of the conversion.

Parameters
doc PDF document to convert to HTML, as a binary node().
filename The root for the name of the converted files and directories. If the specified filename includes an extension, then the extension is appended to the root with an underscore. The directory for other parts of the conversion (images, for example) has the string "_parts" appended to the root. For example, if you specify a filename of "myFile.pdf", the generated names will be "myFile_pdf.xhtml" for the xml node and "myFile_pdf_parts" for the directory containing the any other parts generated by the conversion (images, css files, and so on).
options The options element for this conversion. The node for the options must be in the xdmp:pdf-convert namespace. The default value is () . In addition to the options shown below, you can specify xdmp:tidy options by entering the tidy option elements in the xdmp:tidy namespace.

Options include:

<tidy>

Default value: true

Specify true to run tidy on the document and false not to run tidy. If you run tidy, you can also specify any xdmp:tidy options. Any tidy option elements must be in the xdmp:tidy namespace.

<config>

The configuration file for the conversion. You can specify an absolute path or a relative path. The relative path is relative to the <install_dir>/Converters/cvtpdf directory. The default configuration file is named PDFtoHTML.cfg; it produces a single reflowed XHTML document with CSS styling. Setting this parameter may override the remaining options.

<page-by-page>

Default value: false

Specify true to select a different default configuration file that produces one XHTML document per page with absolute positioning. The default paged configuration file is named PDFtoXHTML_pages.cfg If a specific configuration file is selected with the config option, the page-by-page option has no effect.

<page-start-id>

Default value: 0

The index of the first page to convert. Page indices start at zero.

<page-end-id>

Default value: -1

The index of the last page to convert. Page indices start at zero. The default is -1, meaning to convert through the last page of the document.

<synth-bookmarks>

Default value: true

Enable/disable converter's internal font-based TOC inferences.

<image-output>

Default value: true

Enable/disable extraction and conversion of images.

<text-output>

Default value: true

Enable/disable extraction of text.

<zones>

Default value: false

Enable/disable zone controls. Using true produces better results when the PDF is annotated; using false produces better results in non-annotated tables.

<ignore-text>

Default value: true

Enable/disable extraction of text from images. Documents consisting of scanned pages can only have text extracted if this parameter is set to true; however, diagrams with embedded text labels may be less palatable. For page-by-page conversion, the problem with reflowing of text and graphical elements within a diagram giving poor results is not such a problem, and the value of false will probably be the better choice.

<remove-overprint>

Default value: false

Enable/disable removal of text overlays. Setting this parameter to true can sometimes clean up messy results stemming from reflowing of text that was not visible in the original PDF because it was covered by something else.

<illustrations>

Default value: true

Enable/disable extraction of illustrations. Setting this parameter to false can sometimes clean up messy results stemming from minor and unnecessary graphical ornaments.

<image-quality>

Default value: 75

Determines the quality of extracted and converted images: smaller values mean smaller image sizes (in bytes) but lossier rendering. The maximum is 100.

<page-start>

Default value: none

Boilerplate text inserted at the start of every page. Any XML markup must be escaped. For example: <p>PAGE START</p>

<page-end>

Default value: none

Boilerplate text inserted at the end of every page. XML markup must be escaped.

<document-start>

Default value: none

Boilerplate text inserted at the start of every document. XML markup must be escaped.

<document-end>

Default value: none

Boilerplate text inserted at the end of every document. XML markup must be escaped.

<password>

Default value: none

The password required to open a password-protected PDF.

Sample Options Node:

The following is a sample options node which specifies that tidy is used to clean the generated html, specifies to use the tidy "clean" option, and specifies a particular configuration file to use for the conversion:
<options xmlns="xdmp:pdf-convert"
         xmlns:tidy="xdmp:tidy">
  <tidy>true</tidy>
  <tidy:clean>yes</tidy:clean>
  <config>c:\myConfigFile.cfg</config>
</options>

Usage Notes

This function is part of a separate package which may generate temporary files. These temporary files are not supported by encryption at rest.

This function is not available on Mac OS X.

This function requires separate converter installation package starting with release 9.0-4, see MarkLogic Converters Installation Changes Starting at Release 9.0-4 in the Installation Guide for All Platforms.

The convert functions return several nodes. The first node is a manifest containing the various parts of the conversion. Typically there will be an xml part, a css part, and some image parts. Each part is returned as a separate node in the order shown in the manifest.

Therefore, given the following manifest:

<parts>
  <part>myFile_pdf.xhtml</part>
  <part>myFile_pdf_parts/conv.css</part>
  <part>myFile_pdf_parts/toc.xml</part>
</parts>

the first node of the returned query is the manifest, the second is the "myFile_pdf.xhtml" node, the third is the "myFile_pdf_parts/conv.css" node, and the fourth is the myFile_pdf_parts/toc.xml node.

This function is not supported on Mac OS X. For details, see Supported Platforms in the Installation Guide for All Platforms.

Example

let $results := xdmp:pdf-convert(
                         xdmp:document-get("myFile.pdf"),
                         "myFile.pdf" ),
    $manifest := $results[1]
return
$results[2 to last()]

=> all of the converted nodes
Powered by MarkLogic Server | Terms of Use | Privacy Policy