Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 20

Extracting Metadata and Text From Binary Documents

This chapter describes how to extract metadata and/or text from binary documents. It contains the following sections:

Metadata and Text Extraction Overview

Binary documents often have various associated metadata. For example, a JPEG image from a camera may have metadata of the camera's type and model number, a timestamp of when it was taken, and so on.

MarkLogic Server can access binary document metadata and then store it as XML in a properties document. You can then search and retrieve the metatdata using MarkLogic Server's rich XML search capabilities. In addition, for text-based binary documents, such as those in Microsoft Word format, MarkLogic can extract and index their text content.

MarkLogic Server server offers the XQuery built-in, xdmp:document-filter, and JavaScript method, xdmp.documentFilter, to extract and associate metadata from binary documents: These functions extract metadata and text from binary documents as XHTML. The results may be used as document properties. The extracted text contains little formatting or structure, so it is best used for search, classification, or other text processing.

Usage Examples

The following sections show how xdmp:document-filter works with various file types. The Microsoft Word section also provides code to extract only the metadata elements from combined metadata and text results.

Microsoft Word

The following query and results are for a Microsoft Word document containing only the text 'This is a test':

xquery version "1.0-ml";
  xdmp:document-filter(doc("/documents/test.docx"))

Returns:

  <?xml version="1.0" encoding="UTF-8"?>
  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <meta name="content-type" content="application/msword"/>
      <meta name="filter-capabilities" 
            content="text subfiles HD-HTML"/>
      <meta name="AppName" content="Microsoft Office Word"/>
      <meta name="Author" content="Clark Kent"/>
      <meta name="Company" content="MarkLogic"/>
      <meta name="Creation_Date" content="2011-10-11T02:40:00Z"/>
      <meta name='Description' 
            content='This is my comment.'/>
      <meta name="Last_Saved_Date" content="2011-10-11T02:41:00Z"/>
      <meta name="Line_Count" content="1"/>
      <meta name="Paragraphs_Count" content="1"/>
      <meta name="Revision" content="1"/>
      <meta name="Template" content="Normal"/>
      <meta name="Typist" content="Clark Kent"/>
      <meta name="Word_Count" content="4"/>
      <meta name="isys" content="SubType: Word 2007"/>
      <meta name="size" content="12691"/>
    </head>
    <body>
      <p>
      </p>
      <p>
         This is a test.</p>
      <p>
      </p>
    </body>
  </html>

In the document, the word 'test' is both italicized and bolded. xdmp:document-filter does not return such text formatting.

Expanding on the previous example, the following code uses xdmp:document-filter to extract only the metadata from that same Microsoft Word document:

xquery version "1.0-ml";
let $url := "/documents/test.docx"
return xdmp:document-set-properties(
  $url, 
  for $meta in xdmp:document-filter(fn:doc($the-document))//*:meta
  return element {$meta/@name} {fn:string($meta/@content)}
)

The properties document now looks as follows:

xdmp:document-properties('/documents/test.docx')

returns:

<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
  <content-type>application/msword</content-type>
  <filter-capabilities>text subfiles HD-HTML</filter-capabilities>
  <AppName>Microsoft Office Word</AppName>
  <Author>Clark Kent</Author>
  <Company>MarkLogic</Company>
  <Creation_Date>2011-10-11T02:40:00Z</Creation_Date>
  <Description>This is my comment.</Description>
  <Last_Saved_Date>2011-10-11T02:41:00Z</Last_Saved_Date>
  <Line_Count>1</Line_Count>
  <Paragraphs_Count>1</Paragraphs_Count>
  <Revision>1</Revision>
  <Subject>Creating binary doc props</Subject>
  <Template>Normal/Template>
  <Typist>Clark Kent</Typist>
  <Word_Count>4</Word_Count>
  <isys>SubType: Word 2007</isys>
  <size>12691</size>
  <prop:last-modified>2011-10-12T09:47:10-07:00</prop:last-modified>
</prop:properties>

File Archives

If you need to extract files from zip archives for individual processing, use xdmp:zip-manifest and xdmp:zip-get. Use xdmp:document-filter if you just want all the text from the archive, since it does not preserve the embedded files' structure, but includes all of the documents' text. This is useful for finding the original location in search results; if you search for 'Elvis' and use xdmp:document-filter on the various files, the results include every binary containing 'Elvis', whether it is a zip archive, Word document, or photo.

In this example, xdmp:document-filter runs on the file archive test.zip, which consists of two Word files and a JPEG file,

xquery version "1.0-ml";
  xdmp:document-filter(doc("/documents/test.zip"))

returns

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="content-type" content="application/zip"/>
    <meta name="filter-capabilities" content="subfiles"/>
    <meta name="AppName" content="Microsoft Office Word"/>
    <meta name="Author" content="Lois Lane"/>
    <meta name="Company" content="MarkLogic"/>
    <meta name="Creation_Date" content="2011-10-14T21:11:00Z"/>
    <meta name="Last_Saved_Date" content="2011-10-14T21:11:00Z"/>
    <meta name="Line_Count" content="1"/>
    <meta name="Paragraphs_Count" content="1"/>
    <meta name="Revision" content="2"/>
    <meta name="Template" content="Normal"/>
    <meta name="Typist" content="Lois Lane"/>
    <meta name="Word_Count" content="3"/>
    <meta name="isys" content="SubType: Word 2007"/>
    <meta name="Focal_Length" content="4"/>
    <meta name="Make" content="LG Electronics"/>
    <meta name="Model" content="VM670"/>
    <meta name="Original_Date_Time" content="2011:10:19 14:59:24"/>
    <meta name="Original_Date_Time.datetime"           content="2011-10-19T14:59:24Z"/>
    <meta name="ResolutionUnit" content="2"/>
    <meta name="XResolution" content="72.000000"/>
    <meta name="YResolution" content="72.000000"/>
    <meta name="AppName" content="Microsoft Office Word"/>
    <meta name="Author" content="Clark Kent"/>
    <meta name="Company" content="MarkLogic"/>
    <meta name="Creation_Date" content="2011-10-11T02:40:00Z"/>
    <meta name="Last_Saved_Date" content="2011-10-11T02:41:00Z"/>
    <meta name="Line_Count" content="1"/>
    <meta name="Paragraphs_Count" content="1"/>
    <meta name="Revision" content="1"/>
    <meta name="Template" content="Normal"/>
    <meta name="Typist" content="Clark Kent"/>
    <meta name="Word_Count" content="2"/>
    <meta name="isys" content="SubType: Word 2007"/>
    <meta name="size" content="47730"/>
  </head>
<body>
 <p>
 </p>
 <p>
  This is a another test.</p>
  <p>
  </p>
  <p>
    This is a test.</p>
  <p>
  </p>
</body>
</html>

While each sentence in this example's returned HTML body text is from a different file, there is no way to distinguish which text comes from which file. Similarly, the returned subfile metadata is not guaranteed to be returned in file order (for example, name='a', name='b' might be from different documents in the archive) and so also cannot be correctly associated with an individual subfile.

Also, individual subfiles in the archive are not necessarily distinguishable at all. In the above example, you cannot tell from the output how many files, or what file types, are in the archive. When using xdmp:document-filter on an archive, you should think of the archive as a single file, rather than a compilation of subfiles. You will get back all the metadata and text contained in the single archive file, but will have no way of associating that returned information with the individual subfiles it came from.

PowerPoint

The following query and results are for a two slide PowerPoint document, where each slide has a title and separate content:

xquery version "1.0-ml";
  xdmp:document-filter(doc("/documents/test.pptx"))

returns:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="content-type"           content="application/vnd.ms-powerpoint"/>
    <meta name="filter-capabilities" content="text subfiles HD-HTML"/>
    <title>This is a test </title>
    <meta name="AppName" content="Microsoft Office PowerPoint"/>
    <meta name="Author" content="Clark Kent"/>
    <meta name="Company" content="MarkLogic"/>
    <meta name="Creation_Date" content="2011-10-17T19:58:34Z"/>
    <meta name="Last_Saved_Date" content="2011-10-17T20:00:13Z"/>
    <meta name="Paragraphs_Count" content="4"/>
    <meta name="Presentation_Format" content="On-screen Show (4:3)"/>
    <meta name="Revision" content="1"/>
    <meta name="Slide_Count" content="2"/>
    <meta name="Typist" content="Clark Kent"/>
    <meta name="Word_Count" content="12"/>
    <meta name="isys" content="SubType: PowerPoint 2007"/>
    <meta name="size" content="36909"/>
  </head>
  <body>
<p>
</p>
<p>
This is a test </p>
<p>
Of PowerPoint</p>
<p>
</p>
<p>
Test #3
</p>
<p>
Second Slide.</p>
<p>
</p>
</body>
</html>

Similarly, any text formatting is not returned, nor is any indicator of what role the text played on a slide (title, body, etc.), nor is there any way to tell what text belongs to which slide.

Supported Binary Formats

The following sections list the binary file formats and file extensions from which xdmp:document-filter can extract metadata and, depending on the format, text from. Due to the large number of formats, they are first broken down into general application areas, such as Databases or Multimedia, then each area lists its applicable formats and extensions.

Some formats can be identified by xdmp:document-filter, but have no text or metadata to extract, such as executables. For these, the returned <meta name="content-type" content=.../> identifies the file's format.

Archives

Formats: 7-Zip, ACE, ARJ, Bzip2, ISO Disk Image, Java Archive, LZH, Microsoft Cabinet, Microsoft Office Binder, RedHat Package Manager, Roshal Archive, Self-extracting .exe, StuffIt, StuffIt Self Extracting Archive, SuffIt X, GNU Zip, UNIX cpio, UNIX Tar, Zip, PKZip, WinZip

Extensions: .7Z, .ACE, .ARJ, .BZ2, .CAB, .CPIO, .EXE, .GZ, .ISO, JAR, LZH, .ORD, .RAR, .RPM, .SIT, .SEA, .SITX, .TAR, .TBZ2, .ZIP

Databases

Formats: dBase, dBase III, Microsoft Access, Paradox Database

Extensions: .DB, .DBF, .DB3, .MDB

Email and Messaging

Formats: Encoded mail messages of any of the forms MHT, Multipart Alternative, Multipart Digest, Multipart Mixed, Multipart Newsgroup, Multipart Signed, and TNEF. Also, the individual formats Eudora, Microsoft Outlook, Microsoft Outlook3, Microsoft Outlook Express3. Microsoft Outlook Forms Template, Sendmail 'mbox', Thunderbird

Extensions: .EML, .MBOX, .MBX, .MHT, .MSG, .OFT, .PST

Multimedia

Formats: 3GP, Adobe Flash, Adobe Flash Video, Audio Video Interleave (AVI), DVD Information File, DVD Video Object, Microsoft Windows Movie Maker, Musical Instrument Digital Interface (MIDI), MPEG Video, MPEG-1 Audio Layer 3, MPEG-4 Video, MPEG-2 Audio Layer 3, OGG Flac Audio, OGG Vorbis Audio, QuickTime, Real Media, Waveform Audio File Format (WAVE), Window Media Audio, Windows Media Video.

Extensions: .3GP, .AIFF, .AVI, .BUP, .FLAC, .FLV, .IFO, .MID, .MIDI, .MOV, .MP3, .MP4, .MPG, .MSWMM, .OGG, .RM, .SMF, .SWF, .VOB, .WAV, .WMA, .WMV

Other

Formats: Apple Executable, BIN HEX Encoded, BitTorrent Metafile, Linux Executable and Linkable Format, Log File, Microsoft Project, Microsoft Windows DLL, Microsoft Windows Executable, Microsoft Windows Installer, Microsoft Windows, Shortcut, Open Access II (OAII), VCard, Uniplex

Extensions: .BIN, .COM, .DLL, .ELF, .EXE, .HBX, .HEX, .HQXX, .LNK, .LOG, .MPP, .MPX, .MSI, .SYS, .TORRENT, .VCF

Presentation

Formats: IBM Lotus Symphony Presentation, LibreOffice Presentation, Microsoft PowerPoint for Windows or Macintosh, OpenOffice Impress, StarOffice Impress

Extensions: .ODP, .ODS, .PPT, .PPTX, .SDI, .SDP, .SXI

Raster Image

Formats: Encapsulated PostScript, Grapics Interchange Format (GIF), Joint Photographic Experts Group (JPEG), Microsoft Document Imaging, Microsoft Windows Bitmap, PCX, Portable Network Graphic (PNG), Progressive JPEG, Tagged Image Format File (TIFF)

Extensions: .BMP, .EPS, .GFA, .GIF, .GIFF, .JIF, .JPEG, .JPG, .JPE, .MDI, .PCX, .PNG, .TIF, .TIFF

Spreadsheet

Formats: Comma Separated Values, Franeword Spreadsheet, IBM Lotus Symphony, LibreOffice Spreadsheet, Lotus 1-2-3, Microsoft Excel for Windows or Mac, Microsoft Works SS for DOS or Windows, OpenOffice Calc, StarOffice Calc

Extensions: .CSV, .FW3, .ODS, .SX, .SXC, .SXS, .XLS, XLSB, .XLSX, .WK., .WK3, .WK4, .WKS, .WPS

Text and Markup

Formats: ASCII Text (7 and 8 bit) , ANSI Text (7 and 8 bit), HTML (text only, codes revealed, metadata only), IBM DCA, Microsoft HTML Help, Microsoft OneNote, Rich Text Format, SGML Text, Source, Transcript, Unicode UTF8 and UTF16 and UCS2, XML, Windows Enhanced Meta File, Windows Meta File

Extensions: .CHM, .DCA, .EMF, .HTM, .HTML,.ONE, .RFT, .RTF, .SGML, .TXT, .XML, .WMF

Vector Image

Formats: Adobe Illustrator, Adobe InDesign, Adobe Photoshop, AutoCAD Drawing, AutoCAD drawing Exchange Format, Corel Draw Image, Intergraph-Microstation CAD, MathCAD, Microsoft XPS, Microsoft Visio

Extensions: .AI, .CDR, .DGN, .DWG, .DXF, .INDD, .MCD, .OXPS, .PSD, .VSD, .XMCD, .XPS

Word Processing and General Office

Formats: Adobe PDF, Adobe PostScript, Ami Pro for Windows, Apple iWork, Framework WP, Hangul, IBM DCA/FFT, IBM DisplayWrite, IBM Lotus Symphony Document, JustSystems Ichitaro, LibreOfffice Document, Lotus Manuscript, Lotus Notes, Mass 11, Microsoft Publisher, Microsoft Word for DOS/Windows/Macintosh, QuarkXpress, MultiMate, MultiMate Advantage, OpenOffice Writer, Professional Write for DOS, Professional Write Plus for Windows, Q&A Write, QuickBooks Backup, QuickBooks for Windows, StarOffice Writer, TrueType Font, VCalendar Electronic Calendar, Wang IWP, Wang WP Plus, Windows Write, WinWord, WordPerfect for DOS/Macintosh/Windows, Wordstar for DOS/Windows, Wordstar 2000 for DOS, XYwrite

Extensions: .AMI, .DCA, DOC, .DOCX, .DOX, .DW4, .FFT, .FW3, .ICS, .IWP, .JTD, .JBW, .JTT, ,KEY, .M11, .MAN, .MANU, .MNU, .NSF, .NUMBERS, .ODT, PAGES, .PDF, .PS, .PUT, .QCx, .QXx, .PW, .PW1, .PW2, .QA, .QA3, .QBB, .QBW, .RFT, .SAM, .SXW, .SDW, .TTF, .VCS, .WPD, WRI, .WS, .WS2, .WSD, .XY

« Previous chapter
Next chapter »
Powered by MarkLogic Server 7.0-4.1 and rundmc | Terms of Use | Privacy Policy