Loading TOC...

xdmp:document-filter

xdmp:document-filter(
   $doc as node(),
   [$options as (element()|map:map)?]
) as node()

Summary

Filters a wide variety of document formats, extracting metadata and text, and returning XHTML. The extracted text has very little formatting, and is typically used for search, classification, or other text processing.

Parameters
$doc Document to filter, as binary node().
$options Options element for this extraction. The options element must be in the xdmp:document-filter namespace. The default value is () .

Options include:

<excelmode>

Default value: csv

A value of csv (the default) specifies inclusion of all strings, dates, and numbers, and preserves row-by-row ordering. A value of text specifies text only.

<emailmode>

Default value: VisibleHeaders

A value of VisibleHeaders (the default) specifies inclusion of only commonly displayed email headers. A value of AllHeaders specifies inclusion of all email headers.

<pdfxmpmeta>

Default value: true

A value of true (the default) specifies inclusion of XMP metadata. A value of false suppresses inclusion of XMP metadata.

<pdfbookmarks>

Default value: true

A value of true (the default) specifies inclusion of PDF bookmarks. A value of false suppresses inclusion of PDF bookmarks.

<pdfannotations>

Default value: true

A value of true (the default) specifies inclusion of PDF annotations. A value of false suppresses inclusion of PDF annotations.

<pdfwordorder>

Default value: Reading

A value of Reading (the default) specifies extraction of text in an order as close as possible to that which would be read on a page. A value of Document specifies extraction of text in the order in which it is stored in the document.

<pdfdehyphenate>

Default value: false

A value of true specifies removal of hyphens from the ends of lines so that line-broken words (for example, in a PDF file) are expressed as a single word.

Usage Notes

Document metadata is returned in XHTML meta elements. The document title is in the title element. The format of the document is returned as a MIME media type in a meta element with the name "content-type". Metadata values with recognized date formats are converted to ISO8601.

If the document has metadata but no text, like an audio or video document, the XHTML will have a head element but no body element.

If Microsoft Office documents (for example, xslx) are password-protected, they cannot be successfully filtered.

If Microsoft Office documents (for example, xslx) are password-encrypted, they cannot be successfully filtered.

Example

The following is a sample options node which specifies that PDF
bookmarks are not to appear in the text output:

  <options xmlns="xdmp:document-filter">
    <pdfbookmarks>false</pdfbookmarks>
  </options>

Example

xdmp:document-filter(doc("wordperfect.wpd"))

=> Filters the wordperfect.wpd document to XHTML.

Example

xquery version "1.0-ml";

xdmp:document-filter(
 xdmp:http-get("http://www.marklogic.com/images/logo.gif")[2])
=>
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="content-type" content="image/gif"/>
    <meta name="filter-capabilities" content="none"/>
    <meta name="size" content="2199"/>
  </head>
</html>

Comments

  • I have gone through the 'https://docs.marklogic.com/guide/cpf/default' to set up the pipeline to make the binary document searchable. i can observe that .xml and .xhtml are being generated out of ingested file. when i tried searching using Java Client API search query, i got the results from generated xml file rather than getting the results from ingested file. Please let me know how can i get the uri of the ingested document while performing search using Java Client API search query. Since i want to display the actual document's content so DOC uri would be useful in that.
  • When I put xdmp:document-filter() in a spawn-function() I get out-of-memory errors. 2017-01-10 11:53:52.038 Notice: TaskServer: SVC-PROCESSRUN: xdmp:document-filter(fn:doc("/ediscovery/mydocument.doc")/binary()) -- Process run error: fork: Cannot allocate memory Any suggestions?
  • can we make a content search on binary documents???? I have converted a PDF file to binary and trying to search the content on the same. Can anyone please suggest
    • A PDF is already a binary document. After running xdmp:document-filter() on it (or other binary formats), you can insert the resulting XHTML as a new document, or as properties on the original document. Once in the database, that content will be visible to search. You can make this process happen during ingestion by applying a transform (if using the REST API) or by using CPF (with the "Document Filtering (Properties)" or "Document Filtering (XHTML)" pipeline.
      • When apply document-filter on pdfs,is there a way to show page numbers as separate nodes in the output?
      • HI David, Thanks for your reply. Can you please help me with the steps required to enable the CPF and how well it works during ingestion? Could you please also help me with any standard procedure to set up CPF .
        • Sure: http://docs.marklogic.com/guide/cpf/quickStart
          • Hi David, Is there a way to add metadata to properties while importing the documents into the ML database? If yes, please suggest the procedure to do it.
            • After reading through the CPF Guide, take a look at the built-in "Document Filtering (Properties)" pipeline.
              • Hi David, In the below screen shot, I would also like to add a new name, title, and description to the image document filter. Is it possible to do so?
                • Yes. See <a href="http://docs.marklogic.com/xdmp:document-add-properties">xdmp:document-add-properties</a>.
            • Also could you please let me know how do we add metadata to images ???
              • xdmp:document-filter will extract metadata from images and a lot of other binary formats. You'll store the binary as a document and the metadata in a <a href="http://docs.marklogic.com/guide/app-dev/properties#id_19516">properties fragment</a>.
  • If you're looking to convert PDF files, take a look at xdmp:pdf-convert which has more capabilities. For Office files, there's xdmp:word-convert, xdmp:powerpoint-convert, and xdmp:excel-convert for DOC, PPT, and XLS (but these functions don't support Office 2007 and later; the DOCX, PPTX, and XSLX extensions).
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy