MarkLogic 12 Product Documentation
xdmp.documentFilter

xdmp.documentFilter(
   doc as Node,
   [options as Object?]
) as Node

Summary

Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.

Parameters

doc Document to filter, as binary node().

Parameters
doc	Document to filter, as binary node().
options	Options element for this extraction. The default value is `null`. Options include: excelmode Default value: `csv` A value of `csv` (the default) specifies inclusion of all strings, dates, and numbers, and preserves row-by-row ordering. A value of `text` specifies text only. emailmode Default value: `VisibleHeaders` A value of `VisibleHeaders` (the default) specifies inclusion of only commonly displayed email headers. A value of `AllHeaders` specifies inclusion of all email headers. pdfxmpmeta Default value: `true` A value of `true` (the default) specifies inclusion of XMP metadata. A value of `false` suppresses inclusion of XMP metadata. pdfbookmarks Default value: `true` A value of `true` (the default) specifies inclusion of PDF bookmarks. A value of `false` suppresses inclusion of PDF bookmarks. pdfannotations Default value: `true` A value of `true` (the default) specifies inclusion of PDF annotations. A value of `false` suppresses inclusion of PDF annotations. pdfwordorder Default value: `Reading` A value of `Reading` (the default) specifies extraction of text in an order as close as possible to that which would be read on a page. A value of `Document` specifies extraction of text in the order in which it is stored in the document. pdfdehyphenate Default value: `false` A value of `true` specifies removal of hyphens from the ends of lines so that line-broken words (for example, in a PDF file) are expressed as a single word. extension Default value: `""` A string value indicating the file extension paired with a binary node.

options

Options element for this extraction. The default value is


    null

Options include:

excelmode

Default value: csv

A value of csv (the default) specifies inclusion of all strings, dates, and numbers, and preserves row-by-row ordering. A value of text specifies text only.

emailmode

Default value: VisibleHeaders

A value of VisibleHeaders (the default) specifies inclusion of only commonly displayed email headers. A value of AllHeaders specifies inclusion of all email headers.

pdfxmpmeta

Default value: true

A value of true (the default) specifies inclusion of XMP metadata. A value of false suppresses inclusion of XMP metadata.

pdfbookmarks

Default value: true

A value of true (the default) specifies inclusion of PDF bookmarks. A value of false suppresses inclusion of PDF bookmarks.

pdfannotations

Default value: true

A value of true (the default) specifies inclusion of PDF annotations. A value of false suppresses inclusion of PDF annotations.

pdfwordorder

Default value: Reading

A value of Reading (the default) specifies extraction of text in an order as close as possible to that which would be read on a page. A value of Document specifies extraction of text in the order in which it is stored in the document.

pdfdehyphenate

Default value: false

A value of true specifies removal of hyphens from the ends of lines so that line-broken words (for example, in a PDF file) are expressed as a single word.

extension

Default value: ""

A string value indicating the file extension paired with a binary node.

Usage Notes

This function is part of a separate package which may generate temporary files. These temporary files are not supported by encryption at rest.

This function requires separate converter installation package starting with release 9.0-4, see MarkLogic Converters Installation Changes Starting at Release 9.0-4 in the Installation Guide for All Platforms.

Document metadata is returned in XHTML meta elements. The document title is in the title element. The format of the document is returned as a MIME media type in a meta element with the name "content-type". Metadata values with recognized date formats are converted to ISO8601.

If the document has metadata but no text, like an audio or video document, the XHTML will have a head element but no body element.

If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.

If Microsoft Office documents (for example, xslx) are password-encrypted, they cannot be successfully filtered.

Example

// Basic example
xdmp.documentFilter(cts.doc('some.pdf'));

// Filters the PDF file stored in the database with URI 'some.pdf'

Example

// Including an options object in the call
xdmp.documentFilter(cts.doc('some.pdf'), {pdfbookmarks: false});

// Produces filtering output with PDF bookmarks excluded.

Example

xdmp.documentFilter(
 xdmp.httpGet("http://www.marklogic.com/images/logo.gif").toArray()[1])

// Produces output similar to the following, derived from data in
// the response to an HTTP GET request.
//
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <meta name="content-type" content="image/gif"/>
//     <meta name="filter-capabilities" content="none"/>
//     <meta name="size" content="2199"/>
//   </head>
// </html>

Example

// The doc is binary and is converted by the xlsx file
// The binary is a fake one since the real one is too long to list here
var doc = "D0CF11E0A1B11AE1000XXXXXXX"
var node = new NodeBuilder();
var bin = node.addBinary(doc).toNode();
var options={"extension":"xlsx"};
xdmp.documentFilter(bin,options);

MarkLogic 12 Product Documentationxdmp.documentFilter

Summary

Usage Notes

Example

Example

Example

Example

MarkLogic 12 Product Documentation
xdmp.documentFilter