MarkLogic 10 Product Documentation
xdmp:pdf-convertxdmp:pdf-convert(
$doc as node(),
$filename as xs:string,
[$options as (element()|map:map)?]
) as node()*
Summary
Converts a PDF file to XHTML. Returns several nodes,
including a parts node, the converted document xml node, and any
other document parts (for example, css files and images). The first
node is the parts node, which contains a manifest of all of the parts
generated as result of the conversion.
Parameters |
doc |
PDF document to convert to HTML, as a binary node().
|
filename |
The root for the name of the converted files and directories. If the
specified filename includes an extension, then the extension is appended
to the root with an underscore. The directory for other parts of the
conversion (images, for example) has the string "_parts" appended to the
root. For example, if you specify a filename of "myFile.pdf", the
generated names will be "myFile_pdf.xhtml" for the xml node and
"myFile_pdf_parts" for the directory containing the any other parts
generated by the conversion (images, css files, and so on).
|
options |
The options element
for this conversion. The node for the options must
be in the xdmp:pdf-convert namespace.
The default value is ()
.
In addition to the options shown below, you can
specify xdmp:tidy options by entering the
tidy option elements in the xdmp:tidy namespace.
Options include:
<tidy >
- Default value:
true
Specify true to run tidy on the document and
false not to run tidy.
If you run tidy, you can also specify any
xdmp:tidy
options.
Any tidy option
elements must be in the xdmp:tidy namespace.
<config >
- The configuration file for the conversion. You can specify an
absolute path or a relative path. The relative path is relative
to the
<install_dir>/Converters/cvtpdf directory.
The default configuration file is named PDFtoHTML.cfg ;
it produces a single reflowed XHTML document with CSS styling. Setting
this parameter may override the remaining options.
<page-by-page >
- Default value:
false
Specify true to select a different default configuration
file that produces one XHTML document per page with absolute positioning.
The default paged configuration file is named PDFtoXHTML_pages.cfg
If a specific configuration file is selected with the config
option, the page-by-page
option has no effect.
<page-start-id >
- Default value:
0
The index of the first page to convert. Page indices start at zero.
<page-end-id >
- Default value:
-1
The index of the last page to convert. Page indices start at zero.
The default is -1, meaning to convert through the last page of the
document.
<synth-bookmarks >
- Default value:
true
Enable/disable converter's internal font-based TOC inferences.
<image-output >
- Default value:
true
Enable/disable extraction and conversion of images.
<text-output >
- Default value:
true
Enable/disable extraction of text.
<zones >
- Default value:
false
Enable/disable zone controls. Using true produces better
results when the PDF is annotated; using false produces
better results in non-annotated tables.
<ignore-text >
- Default value:
true
Enable/disable extraction of text from images. Documents consisting of
scanned pages can only have text extracted if this parameter is set to
true ; however, diagrams with embedded text labels may
be less palatable. For page-by-page conversion, the problem with reflowing
of text and graphical elements within a diagram giving poor results is
not such a problem, and the value of false will probably
be the better choice.
<remove-overprint >
- Default value:
false
Enable/disable removal of text overlays. Setting this parameter to
true can sometimes clean up messy results stemming from
reflowing of text that was not visible in the original PDF because it
was covered by something else.
<illustrations >
- Default value:
true
Enable/disable extraction of illustrations. Setting this parameter to
false can sometimes clean up messy results stemming from
minor and unnecessary graphical ornaments.
<image-quality >
- Default value:
75
Determines the quality of extracted and converted images: smaller values
mean smaller image sizes (in bytes) but lossier rendering. The maximum is
100.
<page-start >
- Default value: none
Boilerplate text inserted at the start of every page. Any XML markup
must be escaped. For example: <p>PAGE START</p>
<page-end >
- Default value: none
Boilerplate text inserted at the end of every page. XML markup must be
escaped.
<document-start >
- Default value: none
Boilerplate text inserted at the start of every document. XML markup
must be escaped.
<document-end >
- Default value: none
Boilerplate text inserted at the end of every document. XML markup must
be escaped.
<password >
- Default value: none
The password required to open a password-protected PDF.
Sample Options Node:
- The following is a sample options node which specifies that tidy is
used to clean the generated html, specifies to use the tidy "clean" option,
and specifies a particular configuration file to use for the conversion:
<options xmlns="xdmp:pdf-convert"
xmlns:tidy="xdmp:tidy">
<tidy>true</tidy>
<tidy:clean>yes</tidy:clean>
<config>c:\myConfigFile.cfg</config>
</options>
|
Usage Notes
This function is part of a separate package which may generate temporary files.
These temporary files are not supported by encryption at rest.
This function is not available on Mac OS X.
This function requires separate converter installation package starting with release 9.0-4, see
MarkLogic Converters Installation Changes Starting at Release 9.0-4 in the Installation Guide for All Platforms.
The convert functions return several nodes. The first node is a manifest
containing the various parts of the conversion. Typically there will be
an xml part, a css part, and some image parts. Each part is returned as
a separate node in the order shown in the manifest.
Therefore, given the following manifest:
<parts>
<part>myFile_pdf.xhtml</part>
<part>myFile_pdf_parts/conv.css</part>
<part>myFile_pdf_parts/toc.xml</part>
</parts>
the first node of the returned query is the manifest, the second is the
"myFile_pdf.xhtml" node, the third is the "myFile_pdf_parts/conv.css" node,
and the fourth is the myFile_pdf_parts/toc.xml node.
This function is not supported on Mac OS X. For details, see
Supported Platforms in the Installation Guide for All Platforms.
Example
let $results := xdmp:pdf-convert(
xdmp:document-get("myFile.pdf"),
"myFile.pdf" ),
$manifest := $results[1]
return
$results[2 to last()]
=> all of the converted nodes
Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.