xdmp:document-filter

xdmp:document-filter(
   $doc as node(),
   [$options as (element()|map:map)?]
) as node()

Summary

Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.

Parameters
doc Document to filter, as binary node().
options Options element for this extraction. The options element must be in the xdmp:document-filter namespace. The default value is () .

Options include:

<excelmode>

Default value: csv

A value of csv (the default) specifies inclusion of all strings, dates, and numbers, and preserves row-by-row ordering. A value of text specifies text only.

<emailmode>

Default value: VisibleHeaders

A value of VisibleHeaders (the default) specifies inclusion of only commonly displayed email headers. A value of AllHeaders specifies inclusion of all email headers.

<pdfxmpmeta>

Default value: true

A value of true (the default) specifies inclusion of XMP metadata. A value of false suppresses inclusion of XMP metadata.

<pdfbookmarks>

Default value: true

A value of true (the default) specifies inclusion of PDF bookmarks. A value of false suppresses inclusion of PDF bookmarks.

<pdfannotations>

Default value: true

A value of true (the default) specifies inclusion of PDF annotations. A value of false suppresses inclusion of PDF annotations.

<pdfwordorder>

Default value: Reading

A value of Reading (the default) specifies extraction of text in an order as close as possible to that which would be read on a page. A value of Document specifies extraction of text in the order in which it is stored in the document.

<pdfdehyphenate>

Default value: false

A value of true specifies removal of hyphens from the ends of lines so that line-broken words (for example, in a PDF file) are expressed as a single word.

<extension>

Default value: ""

A string value indicating the file extension paired with a binary node.

Usage Notes

This function is part of a separate package which may generate temporary files. These temporary files are not supported by encryption at rest.

This function requires separate converter installation package starting with release 9.0-4, see MarkLogic Converters Installation Changes Starting at Release 9.0-4 in the Installation Guide for All Platforms.

Document metadata is returned in XHTML meta elements. The document title is in the title element. The format of the document is returned as a MIME media type in a meta element with the name "content-type". Metadata values with recognized date formats are converted to ISO8601.

If the document has metadata but no text, like an audio or video document, the XHTML will have a head element but no body element.

If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.

If Microsoft Office documents (for example, xslx) are password-encrypted, they cannot be successfully filtered.

Example

xdmp:document-filter(doc("wordperfect.wpd"))

(: Filters the wordperfect.wpd document to XHTML. :)

Example

(: Including an options node in the call :)
xquery version "1.0-ml";

xdmp:document-filter(fn:doc('some.pdf'),
  <options xmlns="xdmp:document-filter">
    <pdfbookmarks>false</pdfbookmarks>
  </options>
)

(: Produces filtering output with PDF bookmarks excluded. :)

Example

(: Including an options map in the call :)
xquery version "1.0-ml";

xdmp:document-filter(
  fn:doc('some.pdf'),
  map:map() => map:with("pdfbookmarks", fn:false())
)

(: Produces filtering output with PDF bookmarks excluded. :)

Example

xquery version "1.0-ml";

xdmp:document-filter(
 xdmp:http-get("http://www.marklogic.com/images/logo.gif")[2])

(: Produces output similar to the following, dervied from data in
 : the response to an HTTP GET request.

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="content-type" content="image/gif"/>
    <meta name="filter-capabilities" content="none"/>
    <meta name="size" content="2199"/>
  </head>
</html>
:)

Example

(: The doc is binary and is converted by the xlsx file :)
(: The binary is a fake one since the real one is too long to list here  :)

xquery version "1.0-ml";
 
let $doc := "D0CF11E0A1B11AE1000XXXXXXX"
 
let $bin := binary{xs:hexBinary($doc)}
 
return 
xdmp:document-filter($bin, 
  <options xmlns="xdmp:document-filter">
    <extension>xlsx</extension>
  </options>)
    
Powered by MarkLogic Server | Terms of Use | Privacy Policy