Loading TOC...

xdmp:document-filter

xdmp:document-filter(
   $doc as node(),
   [$options as (element()|map:map)?]
) as node()

Summary

Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.

Parameters
doc Document to filter, as binary node().
options Options element for this extraction. The options element must be in the xdmp:document-filter namespace. The default value is () .

Options include:

<excelmode>

Default value: csv

A value of csv (the default) specifies inclusion of all strings, dates, and numbers, and preserves row-by-row ordering. A value of text specifies text only.

<emailmode>

Default value: VisibleHeaders

A value of VisibleHeaders (the default) specifies inclusion of only commonly displayed email headers. A value of AllHeaders specifies inclusion of all email headers.

<pdfxmpmeta>

Default value: true

A value of true (the default) specifies inclusion of XMP metadata. A value of false suppresses inclusion of XMP metadata.

<pdfbookmarks>

Default value: true

A value of true (the default) specifies inclusion of PDF bookmarks. A value of false suppresses inclusion of PDF bookmarks.

<pdfannotations>

Default value: true

A value of true (the default) specifies inclusion of PDF annotations. A value of false suppresses inclusion of PDF annotations.

<pdfwordorder>

Default value: Reading

A value of Reading (the default) specifies extraction of text in an order as close as possible to that which would be read on a page. A value of Document specifies extraction of text in the order in which it is stored in the document.

<pdfdehyphenate>

Default value: false

A value of true specifies removal of hyphens from the ends of lines so that line-broken words (for example, in a PDF file) are expressed as a single word.

Usage Notes

This function requires separate converter installation package in MarkLogic version 8 releases starting with 8.0-8, see MarkLogic Converters Installation Changes in Version 8 Releases Starting at 8.0-8 in the Installation Guide for All Platforms.

Document metadata is returned in XHTML meta elements. The document title is in the title element. The format of the document is returned as a MIME media type in a meta element with the name "content-type". Metadata values with recognized date formats are converted to ISO8601.

If the document has metadata but no text, like an audio or video document, the XHTML will have a head element but no body element.

If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.

If Microsoft Office documents (for example, xslx) are password-encrypted, they cannot be successfully filtered.

Example

The following is a sample options node which specifies that PDF
bookmarks are not to appear in the text output:

  <options xmlns="xdmp:document-filter">
    <pdfbookmarks>false</pdfbookmarks>
  </options>

Example

xdmp:document-filter(doc("wordperfect.wpd"))

=> Filters the wordperfect.wpd document to XHTML.

Example

xquery version "1.0-ml";

xdmp:document-filter(
 xdmp:http-get("http://www.marklogic.com/images/logo.gif")[2])
=>
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="content-type" content="image/gif"/>
    <meta name="filter-capabilities" content="none"/>
    <meta name="size" content="2199"/>
  </head>
</html>

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.