Loading TOC...
Loading Content Into MarkLogic Server (PDF)

Loading Content Into MarkLogic Server — Chapter 1

Designing a Content Loading Strategy

MarkLogic Server provides many ways to load content into a database including built-in XQuery functions, the REST Client API, and the command-line tool, MarkLogic Content Pump (mlcp). Choosing the appropriate method for a specific use case depends on many factors, including the characteristics of your content, the source of the content, the frequency of loading, and whether the content needs to be repaired or modified during loading. In addition, environmental and operational factors such as workflow integration, development resources, performance considerations, and developer expertise often need to be considered in choosing the best tools and mechanisms.

The MarkLogic mechanisms for loading content provide varying trade-offs along a number of dimensions such as the following:

  • Usability and flexibility of the interface itself
  • Performance, scalability, I/O capacity
  • Loading frequency
  • Automation or scripting requirements
  • Workflow and integration requirements

This chapter lists the various tools to load content and contains the following sections:

Available Content Loading Interfaces

There are several ways to load content into MarkLogic Server. The following table summarizes content loading interfaces and their benefits.

Interface/Tool Description Benefits
MarkLogic Content Pump (mlcp) A command line tool for loading content into a MarkLogic database, extracting content from a MarkLogic database, or copying content between MarkLogic databases.

Ease of workflow integration, can leverage Hadoop processing,

bulk loading of billions of local files,

split and load aggregate XML or delimited text files

MarkLogic Connector for Hadoop A set of Java classes that enables loading content from HDFS into MarkLogic Server. Distributed processing of large amounts of data
Java Client API A Java API for creating applications on top of MarkLogic Server. The API includes document manipulation and search operations. Leverage existing Java programming skills
Node.js Client API A set of Node.js interfaces for creating applications on top of MarkLogic Server. The API includes document manipulation and search operations. Leverage existing Node.js programming skills
REST Client API A set of HTTP REST services hosted that enable developers to build applications on top of MarkLogic Server. The API includes document manipulation and search operations. Leverage existing REST programming skills
XCC XML Contentbase Connector (XCC) is an interface to communicate with MarkLogic Server from a Java middleware application layer Create multi-tier applications with MarkLogic Server as the underlying content repository
XQuery API An extensive set of XQuery functions that provides maximum control Flexibility and expanded capabilities
Server-Side JavaScript API An extensive set of JavaScript functions that execute on MarkLogic function and provide maximum control Flexibility and expanded capabilities

Loading Activities

There are various things you can do with each of the loading interfaces, all resulting in ingesting data into the database. The following are some of the things you might do through the interfaces. Which interface you use is a matter of which you are most comfortable with as well as trade-offs that some might have over others (for example, ease-of-use versus extensibility). While each tool can usually accommodate each of these activities, in cases where one tool has a specific feature to make one of these activities easy, it is called out in the list.

  • Load from a directory (mlcp)
  • Load from compressed files (mlcp)
  • Split single aggregate XML file into multiple documents (mlcp)
  • Load large numbers of small files
  • Load delimited text files (mlcp)
  • Enrich the documents
  • Extract information from the documents during ingestion (metadata, new elements)
  • Extract some information and load only extracted information
  • Load large binary files
  • Create neutral format archive (mlcp)
  • Copy from one ML database to another ML database (mlcp)

What to Consider Before Loading Content

Designing your content loading strategy depends on the complexity of your source content, the nature of the output to be inserted into the database and many other factors. This section lists some of the areas to think about with links to more detailed discussions and contains the following parts:

Setting Document Permissions

When you load documents into a database, be sure you either explicitly set permissions in the document loading API or have configured default permissions on the user (or on roles for that user) who is loading the documents. Default permissions are applied to a document when it is loaded if you do not explicitly set permissions.

Permissions on a document control access to capabilities (read, insert, update, and execute) on the document. Each permission consists of a capability and a corresponding role. To have a specific capability for a document, a user must have the role paired with that capability on the document permission. Default permissions are specified on roles and on users in the Admin Interface.

If you load a document without the needed permissions, users might not be able to read, update, or execute the document (even by the user who loaded the document). For an overview of security, see Security Guide. For details on creating privileges and setting permissions, see the Security Administration chapter of the Administrator's Guide.

When you load a document, be sure that a named role has update permissions. For any document created by a user who does not have the admin role, the document must be created with at least one update permission or MarkLogic throws an XDMP-MUSTHAVEUPDATE exception during document creation. If there is no role on a document's permissions with update capability, or if the document has no permissions, then only users with the admin role can update or delete the document.

Schemas

Schemas are automatically invoked by the server when loading documents (for conducting content repair) and when evaluating queries (for proper data typing). If you plan to use schemas in your content loading strategy, review the information in the Loading Schemas chapter in the Application Developer's Guide.

Fragments

When loading data into a database, you have the option of specifying how XML documents are partitioned for storage into smaller blocks of information called fragments. For large XML documents, size can be an issue, and using fragments may help manage performance of your system. For a discussion of fragments, see Fragments in the Administrator's Guide.

Indexing

Before loading documents into a database, you have the option of specifying a number of parameters that impact how the text components of those documents are indexed. These settings can affect query performance and disk usage. For details, see Text Indexing in the Administrator's Guide.

« Table of contents
Next chapter »