Loading TOC...


   doc as Node,
   [options as Object?]
) as Node


Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.

doc Document to filter, as binary node().
options Options element for this extraction. The default value is null.

Options include:


Default value: csv

A value of csv (the default) specifies inclusion of all strings, dates, and numbers, and preserves row-by-row ordering. A value of text specifies text only.


Default value: VisibleHeaders

A value of VisibleHeaders (the default) specifies inclusion of only commonly displayed email headers. A value of AllHeaders specifies inclusion of all email headers.


Default value: true

A value of true (the default) specifies inclusion of XMP metadata. A value of false suppresses inclusion of XMP metadata.


Default value: true

A value of true (the default) specifies inclusion of PDF bookmarks. A value of false suppresses inclusion of PDF bookmarks.


Default value: true

A value of true (the default) specifies inclusion of PDF annotations. A value of false suppresses inclusion of PDF annotations.


Default value: Reading

A value of Reading (the default) specifies extraction of text in an order as close as possible to that which would be read on a page. A value of Document specifies extraction of text in the order in which it is stored in the document.


Default value: false

A value of true specifies removal of hyphens from the ends of lines so that line-broken words (for example, in a PDF file) are expressed as a single word.


Default value: ""

A string value indicating the file extension paired with a binary node.

Usage Notes

This function is part of a separate package which may generate temporary files. These temporary files are not supported by encryption at rest.

This function requires separate converter installation package starting with release 9.0-4, see MarkLogic Converters Installation Changes Starting at Release 9.0-4 in the Installation Guide for All Platforms.

Document metadata is returned in XHTML meta elements. The document title is in the title element. The format of the document is returned as a MIME media type in a meta element with the name "content-type". Metadata values with recognized date formats are converted to ISO8601.

If the document has metadata but no text, like an audio or video document, the XHTML will have a head element but no body element.

If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.

If Microsoft Office documents (for example, xslx) are password-encrypted, they cannot be successfully filtered.


// Basic example

// Filters the PDF file stored in the database with URI 'some.pdf'


// Including an options object in the call
xdmp.documentFilter(cts.doc('some.pdf'), {pdfbookmarks: false});

// Produces filtering output with PDF bookmarks excluded.



// Produces output similar to the following, derived from data in
// the response to an HTTP GET request.
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <meta name="content-type" content="image/gif"/>
//     <meta name="filter-capabilities" content="none"/>
//     <meta name="size" content="2199"/>
//   </head>
// </html>


// The doc is binary and is converted by the xlsx file
// The binary is a fake one since the real one is too long to list here
var doc = "D0CF11E0A1B11AE1000XXXXXXX"
var node = new NodeBuilder();
var bin = node.addBinary(doc).toNode();
var options={"extension":"xlsx"};

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.