xdmp.documentFilter( doc as Node, [options as Object?] ) as Node
Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.
This function is part of a separate package which may generate temporary files. These temporary files are not supported by encryption at rest.
This function requires separate converter installation package starting with release 9.0-4, see MarkLogic Converters Installation Changes Starting at Release 9.0-4 in the Installation Guide for All Platforms.
Document metadata is returned in XHTML meta
elements.
The document title is in the title
element.
The format of the document is returned as a MIME media type
in a meta
element with the name "content-type".
Metadata values with recognized date formats are
converted to ISO8601.
If the document has metadata but no text, like an audio
or video document, the XHTML will have a head
element but
no body
element.
If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.
If Microsoft Office documents (for example, xslx) are password-encrypted, they cannot be successfully filtered.
// Basic example xdmp.documentFilter(cts.doc('some.pdf')); // Filters the PDF file stored in the database with URI 'some.pdf'
// Including an options object in the call xdmp.documentFilter(cts.doc('some.pdf'), {pdfbookmarks: false}); // Produces filtering output with PDF bookmarks excluded.
xdmp.documentFilter( xdmp.httpGet("http://www.marklogic.com/images/logo.gif").toArray()[1]) // Produces output similar to the following, derived from data in // the response to an HTTP GET request. // // <html xmlns="http://www.w3.org/1999/xhtml"> // <head> // <meta name="content-type" content="image/gif"/> // <meta name="filter-capabilities" content="none"/> // <meta name="size" content="2199"/> // </head> // </html>
// The doc is binary and is converted by the xlsx file // The binary is a fake one since the real one is too long to list here var doc = "D0CF11E0A1B11AE1000XXXXXXX" var node = new NodeBuilder(); var bin = node.addBinary(doc).toNode(); var options={"extension":"xlsx"}; xdmp.documentFilter(bin,options);