Loading TOC...
Matches for cat:guide/ingestion (cat:guide/ingestion (cat:guide (cat:guide/ingestion))) have been highlighted. remove
Loading Content Into MarkLogic Server (PDF)

MarkLogic Server 11.0 Product Documentation
Loading Content Into MarkLogic Server
— Chapter 4

Loading Content Using XQuery

This chapter describes the XQuery interface for loading content and includes the following sections:

Built-In Document Loading Functions

The xdmp:document-load, xdmp:document-insert, and xdmp:document-get functions can all be used as part of loading documents into a database. The xdmp:document-load function allows you to load documents from the filesystem into the database. The xdmp:document-insert function allows you to insert an existing node into a document (either a new or an existing document). The xdmp:document-get function loads a document from disk into memory. If you are loading a new document, the combination of xdmp:document-get and xdmp:document-insert is equivalent to xdmp:document-load of a new document.

You may only load external binary documents using xdmp:document-insert of a constructed external-binary node. For details, see Creating External Binary References Using XQuery.

The version 2.x xdmp:load and xdmp:get functions are deprecated in the current version of MarkLogic Server; in their place, use the xdmp:document-load and xdmp:document-get functions.

The basic syntax of xdmp:document-load is as follows:

xdmp:document-load(
   $location as xs:string,
   [$options as node()]
) as empty-sequence()

The basic syntax of xdmp:document-insert is as follows:

xdmp:document-insert(
   $uri as xs:string],
   $root as node()
   [$permissions as element(sec:permission)*],
   [$collections as xs:string*],
   [$quality as xs:integer],
   [$forest-ids as xs:unsignedLong*]
) as empty-sequence()

The basic syntax of xdmp:document-get is as follows:

xdmp:document-get(
   $location as xs:string],
   [$options as node()]
) as xs:node()

See the XQuery and XSLT Reference Guide for a more detailed syntax description.

Specifying a Forest in Which to Load a Document

In most situations, MarkLogic Server does a good job of determining which forest to put a document, and in general you should not need to override the defaults. When loading a document, however, you can use the <forests> node in an options node for xdmp:document-load, or the $forest-id argument to xdmp:document-insert (the sixth argument) to specify one or more forests to which the document is loaded. Specifying multiple forest IDs loads the document into one of the forests specified; the system decides which one of the specified forests to load the document. Once the document is loaded into a forest, it stays in that forest unless you delete the document, reload it specifying a different forest, or clear the forest.

In order to load a document into a forest by explicitly specifying a forest key, the forest must exist and be attached to the database into which you are loading. Attempting to load a document into a forest that does not belong to the context database will throw an exception. Additionally, the locking parameter must be set to strict on the database configuration, otherwise an XDMP-PLACEKEYSLOCKING exception is thrown.

This section describes some aspects of forest-specific loading and includes the following parts:

Consider If You Really Want to Specify a Forest

For most applications, you should not specify the forest in which you want to load a document. MarkLogic Server has efficient ways of determining which forest to load a document, and those ways are almost always better than explicitly specifying the forest. The default way MarkLogic spreads documents across forests is optimized for both query and loading efficiency. If you are using Tiered Storage (for details, see Tiered Storage), it has its own way of partitioning documents that you should follow.

One of the pitfalls of specifying a forest, is that the URI you are loading may already exist in another forest within the same database. This is a form of content corruption and will cause searches that select that URI to return with an XDMP-DBDUPURI error. If you run into this error, this Knowledge Base article contains a solution as well as some strategies for preventing duplicate URIs.

If you really want to specify the forest to which you load a document, the following describes some details about forest-specific loading.

Some Potential Advantages of Specifying a Forest

Because backup operations are performed at either the database or the forest level, loading a set of documents into specific forests allows you to effectively perform backup operations on that set of documents (by backing up the database or forest, for example).

Specifying a forest also allows you to have more control over the filesystems in which the documents reside. Each forest configuration includes a directory where the files are stored. By specifying the forest in which a document resides, you can control the directories (and in turn, the filesystems) in which the documents are stored. For example, you might want to place large, frequently accessed documents in a forest which resides on a RAID filesystem with complete failover and redundancy, whereas you might want to place documents which are small and rarely accessed in a forest which resides in a slower (and less expensive) filesystem.

Once a document is loaded into a forest, you cannot move it to another forest. If you want to change the forest in which a document resides, you must reload the document and specify another forest.

Example: Examining a Document to Decide Which Forest to Specify

You can use the xdmp:document-get function to a load a document into memory. One use for loading a document into memory is the ability to perform some processing or logic on the document before you load the document onto disk.

For example, if you want to make a decision about which forest to load a document into based on the document contents, you can put some simple logic in your load script as follows:

let $memoryDoc := xdmp:document-get("c:\myFiles\newDocument.xml")
let $forest :=
     if( $memoryDoc//ID gt "1000000" )
     then xdmp:forest("LargeID")
     else xdmp:forest("SmallID")
return
     xdmp:document-insert("/myCompany/newDocument.xml", 
                          $memoryDoc,
          xdmp:default-permissions(), 
          xdmp:default-collections(), 
                          0,
                          $forest)

This code loads the document newDocument.xml into memory, finds the ID element in the in-memory document, and then inserts the node into the forest named LargeID if the ID is greater than 1,000,000, or inserts the node into the forest named SmallID if the ID is less than 1,000,000.

More Examples

The following command loads the document into the forest named myForest:

xdmp:document-load("c:\myFile.xml",
       <options xmlns="xdmp:document-load">
          <uri>/myDocs/myDocument.xml</uri>
          <permissions>{xdmp:default-permissions()}</permissions> 
          <collections>{xdmp:default-collections()}</collections> 
          <repair>full</repair>
          <forests>
            <forest>{xdmp:forest("myForest")}</forest>
          </forests>
        </options> )

The following command loads the document into either the forest named redwood or the forest named aspen:

xdmp:document-load("c:\myFile.xml",
       <options xmlns="xdmp:document-load">
          <uri>/myDocs/myDocument.xml</uri>
          <permissions>{xdmp:default-permissions()}</permissions> 
          <collections>{xdmp:default-collections()}</collections> 
          <repair>full</repair>
          <forests>
            <forest>{xdmp:forest("redwood")}</forest>
            <forest>{xdmp:forest("aspen")}</forest>
          </forests>
        </options> )

Creating External Binary References Using XQuery

An external binary node is a special reference to a binary file managed and stored in the file system separately from MarkLogic Server. You can create an external binary node in MarkLogic and insert the node in the database, creating an external binary reference document. The external binary reference document acts like a normal binary document, except that MarkLogic never actually stores the binary data internally, and instead transparently accesses the external file every time the document is accessed. Unlike normal binary documents, you do not use xdmp:document-load to insert an external binary reference document in the database. To insert an external binary reference document into the database, you first create a binary node using the xdmp:external-binary function and then insert the node into the database using xdmp:document-insert.

For example, the following code creates a document representing the external binary file /external/path/sample.jpg, beginning at offset 1 in the file, with a length of 1M:

xdmp:document-insert("/docs/xbin/sample.jpg",
    xdmp:external-binary(
        "/external/path/sample.jpg", 1,1024000))

When you provide a length to xdmp:external-binary, MarkLogic Server does not verify the existence or size of the external file. If you omit a length when calling xdmp:external-binary, the underlying external file must exist, and MarkLogic Server calculates the length in a manner equivalent to calling xdmp:filesystem-file-length.

« Previous chapter
Next chapter »