xdmp:document-filter( $doc as node(), [$options as (element()|map:map)?] ) as node()
Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.
This function requires separate converter installation package in MarkLogic version 8 releases starting with 8.0-8, see MarkLogic Converters Installation Changes in Version 8 Releases Starting at 8.0-8 in the Installation Guide for All Platforms.
Document metadata is returned in XHTML meta
elements.
The document title is in the title
element.
The format of the document is returned as a MIME media type
in a meta
element with the name "content-type".
Metadata values with recognized date formats are
converted to ISO8601.
If the document has metadata but no text, like an audio
or video document, the XHTML will have a head
element but
no body
element.
If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.
If Microsoft Office documents (for example, xslx) are password-encrypted, they cannot be successfully filtered.
The following is a sample options node which specifies that PDF bookmarks are not to appear in the text output: <options xmlns="xdmp:document-filter"> <pdfbookmarks>false</pdfbookmarks> </options>
xdmp:document-filter(doc("wordperfect.wpd")) => Filters the wordperfect.wpd document to XHTML.
xquery version "1.0-ml"; xdmp:document-filter( xdmp:http-get("http://www.marklogic.com/images/logo.gif")[2]) => <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="content-type" content="image/gif"/> <meta name="filter-capabilities" content="none"/> <meta name="size" content="2199"/> </head> </html>
Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.