xdmp:document-filter( $doc as node(), [$options as (element()|map:map)?] ) as node()
Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.
This function is part of a separate package which may generate temporary files. These temporary files are not supported by encryption at rest.
This function requires separate converter installation package starting with release 9.0-4, see MarkLogic Converters Installation Changes Starting at Release 9.0-4 in the Installation Guide for All Platforms.
Document metadata is returned in XHTML meta
elements.
The document title is in the title
element.
The format of the document is returned as a MIME media type
in a meta
element with the name "content-type".
Metadata values with recognized date formats are
converted to ISO8601.
If the document has metadata but no text, like an audio
or video document, the XHTML will have a head
element but
no body
element.
If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.
If Microsoft Office documents (for example, xslx) are password-encrypted, they cannot be successfully filtered.
xdmp:document-filter(doc("wordperfect.wpd")) (: Filters the wordperfect.wpd document to XHTML. :)
(: Including an options node in the call :) xquery version "1.0-ml"; xdmp:document-filter(fn:doc('some.pdf'), <options xmlns="xdmp:document-filter"> <pdfbookmarks>false</pdfbookmarks> </options> ) (: Produces filtering output with PDF bookmarks excluded. :)
(: Including an options map in the call :) xquery version "1.0-ml"; xdmp:document-filter( fn:doc('some.pdf'), map:map() => map:with("pdfbookmarks", fn:false()) ) (: Produces filtering output with PDF bookmarks excluded. :)
xquery version "1.0-ml"; xdmp:document-filter( xdmp:http-get("http://www.marklogic.com/images/logo.gif")[2]) (: Produces output similar to the following, dervied from data in : the response to an HTTP GET request. <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="content-type" content="image/gif"/> <meta name="filter-capabilities" content="none"/> <meta name="size" content="2199"/> </head> </html> :)
(: The doc is binary and is converted by the xlsx file :) (: The binary is a fake one since the real one is too long to list here :) xquery version "1.0-ml"; let $doc := "D0CF11E0A1B11AE1000XXXXXXX" let $bin := binary{xs:hexBinary($doc)} return xdmp:document-filter($bin, <options xmlns="xdmp:document-filter"> <extension>xlsx</extension> </options>)