xdmp.pdfConvert

xdmp.pdfConvert(
   doc as Node,
   filename as String,
   [options as Object?]
) as Sequence

Summary

Converts a PDF file to XHTML. Returns several nodes, including a parts node, the converted document xml node, and any other document parts (for example, css files and images). The first node is the parts node, which contains a manifest of all of the parts generated as result of the conversion.

Parameters

doc PDF document to convert to HTML, as a binary node().

filename The root for the name of the converted files and directories. If the specified filename includes an extension, then the extension is appended to the root with an underscore. The directory for other parts of the conversion (images, for example) has the string "_parts" appended to the root. For example, if you specify a filename of "myFile.pdf", the generated names will be "myFile_pdf.xhtml" for the xml node and "myFile_pdf_parts" for the directory containing the any other parts generated by the conversion (images, css files, and so on).

Parameters
doc	PDF document to convert to HTML, as a binary node().
filename	The root for the name of the converted files and directories. If the specified filename includes an extension, then the extension is appended to the root with an underscore. The directory for other parts of the conversion (images, for example) has the string "_parts" appended to the root. For example, if you specify a filename of "myFile.pdf", the generated names will be "myFile_pdf.xhtml" for the xml node and "myFile_pdf_parts" for the directory containing the any other parts generated by the conversion (images, css files, and so on).
options	The options object for this conversion. The default value is `null`. In addition to the options shown below, you can add `xdmp.tidy` options directly. Options include: `tidy` Default value: `true` Specify `true` to run tidy on the document and `false` not to run tidy. If you run tidy, you can also specify any xdmp.tidy options. `config` The configuration file for the conversion. You can specify an absolute path or a relative path. The relative path is relative to the `<install_dir>/Converters/cvtpdf` directory. The default configuration file is named `PDFtoHTML.cfg`; it produces a single reflowed XHTML document with CSS styling. Setting this parameter may override the remaining options. `pageByPage` Default value: `false` Specify `true` to select a different default configuration file that produces one XHTML document per page with absolute positioning. The default paged configuration file is named `PDFtoXHTML_pages.cfg` If a specific configuration file is selected with the `config` option, the `pageByPage` option has no effect. `pageStartId` Default value: `0` The index of the first page to convert. Page indices start at zero. `pageEndId` Default value: `-1` The index of the last page to convert. Page indices start at zero. The default is -1, meaning to convert through the last page of the document. `synthBookmarks` Default value: `true` Enable/disable converter's internal font-based TOC inferences. `imageOutput` Default value: `true` Enable/disable extraction and conversion of images. `textOutput` Default value: `true` Enable/disable extraction of text. `zones` Default value: `false` Enable/disable zone controls. Using `true` produces better results when the PDF is annotated; using `false` produces better results in non-annotated tables. `ignoreText` Default value: `true` Enable/disable extraction of text from images. Documents consisting of scanned pages can only have text extracted if this parameter is set to `true`; however, diagrams with embedded text labels may be less palatable. For page-by-page conversion, the problem with reflowing of text and graphical elements within a diagram giving poor results is not such a problem, and the value of `false` will probably be the better choice. `removeOverprint` Default value: `false` Enable/disable removal of text overlays. Setting this parameter to `true` can sometimes clean up messy results stemming from reflowing of text that was not visible in the original PDF because it was covered by something else. `illustrations` Default value: `true` Enable/disable extraction of illustrations. Setting this parameter to `false` can sometimes clean up messy results stemming from minor and unnecessary graphical ornaments. `imageQuality` Default value: `75` Determines the quality of extracted and converted images: smaller values mean smaller image sizes (in bytes) but lossier rendering. The maximum is 100. `pageStart` Default value: none Boilerplate text inserted at the start of every page. Any XML markup must be escaped. For example: `<p>PAGE START</p>` `pageEnd` Default value: none Boilerplate text inserted at the end of every page. XML markup must be escaped. `documentStart` Default value: none Boilerplate text inserted at the start of every document. XML markup must be escaped. `documentEnd` Default value: none Boilerplate text inserted at the end of every document. XML markup must be escaped. `password` Default value: none The password required to open a password-protected PDF. Sample Options Node: The following is a sample options node which specifies that tidy is used to clean the generated html, specifies to use the tidy "clean" option, and specifies a particular configuration file to use for the conversion: { 'tidy': true, 'clean': 'yes', 'config': "c:\myConfigFile.cfg" }

options

The options object for this conversion. The default value is


    null

. In addition to the options shown below, you can add xdmp.tidy options directly.

Options include:

tidy

Default value: true

Specify true to run tidy on the document and false not to run tidy. If you run tidy, you can also specify any xdmp.tidy options.

config

The configuration file for the conversion. You can specify an absolute path or a relative path. The relative path is relative to the <install_dir>/Converters/cvtpdf directory. The default configuration file is named PDFtoHTML.cfg; it produces a single reflowed XHTML document with CSS styling. Setting this parameter may override the remaining options.

pageByPage

Default value: false

Specify true to select a different default configuration file that produces one XHTML document per page with absolute positioning. The default paged configuration file is named PDFtoXHTML_pages.cfg If a specific configuration file is selected with the config option, the pageByPage option has no effect.

pageStartId

Default value: 0

The index of the first page to convert. Page indices start at zero.

pageEndId

Default value: -1

The index of the last page to convert. Page indices start at zero. The default is -1, meaning to convert through the last page of the document.

synthBookmarks

Default value: true

Enable/disable converter's internal font-based TOC inferences.

imageOutput

Default value: true

Enable/disable extraction and conversion of images.

textOutput

Default value: true

Enable/disable extraction of text.

zones

Default value: false

Enable/disable zone controls. Using true produces better results when the PDF is annotated; using false produces better results in non-annotated tables.

ignoreText

Default value: true

Enable/disable extraction of text from images. Documents consisting of scanned pages can only have text extracted if this parameter is set to true; however, diagrams with embedded text labels may be less palatable. For page-by-page conversion, the problem with reflowing of text and graphical elements within a diagram giving poor results is not such a problem, and the value of false will probably be the better choice.

removeOverprint

Default value: false

Enable/disable removal of text overlays. Setting this parameter to true can sometimes clean up messy results stemming from reflowing of text that was not visible in the original PDF because it was covered by something else.

illustrations

Default value: true

Enable/disable extraction of illustrations. Setting this parameter to false can sometimes clean up messy results stemming from minor and unnecessary graphical ornaments.

imageQuality

Default value: 75

Determines the quality of extracted and converted images: smaller values mean smaller image sizes (in bytes) but lossier rendering. The maximum is 100.

pageStart

Default value: none

Boilerplate text inserted at the start of every page. Any XML markup must be escaped. For example: <p>PAGE START</p>

pageEnd

Default value: none

Boilerplate text inserted at the end of every page. XML markup must be escaped.

documentStart

Default value: none

Boilerplate text inserted at the start of every document. XML markup must be escaped.

documentEnd

Default value: none

Boilerplate text inserted at the end of every document. XML markup must be escaped.

password

Default value: none

The password required to open a password-protected PDF.

Sample Options Node:
The following is a sample options node which specifies that tidy is used to clean the generated html, specifies to use the tidy "clean" option, and specifies a particular configuration file to use for the conversion:
{
  'tidy': true,
  'clean': 'yes',
  'config': "c:\myConfigFile.cfg"
}

Usage Notes

This function is part of a separate package which may generate temporary files. These temporary files are not supported by encryption at rest.

This function is not available on Mac OS X.

This function requires separate converter installation package starting with release 9.0-4, see MarkLogic Converters Installation Changes Starting at Release 9.0-4 in the Installation Guide for All Platforms.

The convert functions return several nodes. The first node is a manifest containing the various parts of the conversion. Typically there will be an xml part, a css part, and some image parts. Each part is returned as a separate node in the order shown in the manifest.

Therefore, given the following manifest:

<parts>
  <part>myFile_pdf.xhtml</part>
  <part>myFile_pdf_parts/conv.css</part>
  <part>myFile_pdf_parts/toc.xml</part>
</parts>

the first node of the returned query is the manifest, the second is the "myFile_pdf.xhtml" node, the third is the "myFile_pdf_parts/conv.css" node, and the fourth is the myFile_pdf_parts/toc.xml node.

This function is not supported on Mac OS X. For details, see Supported Platforms in the Installation Guide for All Platforms.

Example

var results = xdmp.pdfConvert(
                xdmp.documentGet("/space/Hello.pdf"),
                "Hello.pdf");
var it = results[Symbol.iterator]();
var manifest= it.next().value;
var pdfAsXHTML = it.next().value;
pdfAsXHTML;

=> The pdf document converted as xhtml.  The results variable
   is a Sequence, where the first item is the manifest, and the 
   remaining items are the converted nodes.

Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.