Overview (MarkLogic Connector for Hadoop v2.4)

Packages
Package	Description
com.marklogic.dom	This package provides W3C DOM based APIs to the internal on-disk representation of documents and their contents in the expanded tree cache of a MarkLogic database forest.
com.marklogic.io
com.marklogic.mapreduce	MarkLogic Connector for Hadoop core interfaces.
com.marklogic.mapreduce.examples	Examples of using MarkLogic Server in MapReduce jobs.
com.marklogic.mapreduce.functions	Interfaces for using a MarkLogic Server lexicon as an input source.
com.marklogic.mapreduce.test
com.marklogic.mapreduce.utilities
com.marklogic.tree

This bundle provides an API for a MarkLogic Server content connector for Apache Hadoop MapReduce. The overview covers the following topics:

Introduction
Configuration

For detailed information, see the MarkLogic Connector for Hadoop Developer's Guide.

Introduction

The MarkLogic Connector for Hadoop API allows you to use MarkLogic Server as either or both a Hadoop MapReduce input source and an output destination.

The following classes are provided for defining MarkLogic-specific key and value types for your MapReduce key-value pairs:

NodePath for keys
DocumentURI for keys
MarkLogicNode for values

You may also use Apache Hadoop MapReduce types such as Text in certain circumstances. See ValueInputFormat KeyValueInputFormat.

You may generate input data using MarkLogic Server lexicon functions by subclassing one of the lexicon function wrapper classes in com.marklogic.mapreduce.functions. Use lexicon functions with ValueInputFormat and KeyValueInputFormat.

The following classes are provided for defining MarkLogic-specific MapReduce input and output formats. Input and output formats need not be the same type.

Configuration

Configure the connector using the standard Hadoop configuration mechanism. That is, use a Hadoop configuration file to define property values, or set properties programmatically on your Job's Configuration object.

The configuration properties available for the connector are described in MarkLogicConstants.

When using MarkLogic Server as an input source for MapReduce tasks, you may use either basic or advanced input mode. The default is basic mode. The mode is controlled through the mapreduce.marklogic.input.mode property. The following sections describe the input modes briefly. For details, see the MarkLogic Connector for Hadoop Developer's Guide.

Configuring the Input Query With a Path Expression

In basic mode, you may supply components of an XQuery path expression which the connector uses to generate input data. You may not use this option along with a lexicon function class.

To allow MarkLogic Server to optimize the input query, the path expression is constructed from two components: A document node selector and a sub-document expression.

The input split is not configurable in basic mode. The splits are based on a rough count of the number of fragments in each forest. Use advanced input mode for more control over input split generation.

Conceptually, the input data for each task is constructed from a path expression similar to:


$document-selector/$subdocument-expression

Both components of the input path expression are optional. If no document selector is given, fn:collection() is used. If no subdocument expression is given, the document nodes returned by the document selector are used as the input values.

Examples:


document selector: none
subdocument expression: none
  => All document nodes in fn:collection()

document selector: fn:collection("wiki-topics")
subdocument expression: none
  => All document nodes in the "wiki-topics" collection

document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]
  => All elements in the "wiki-topics" collection containing hrefs

document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]/@title
  => The titles of all documents in the "wiki-topics" collection 
     containing hrefs

Configuring the Input Query with a Lexicon Function

In basic mode, you may gather input data using a MarkLogicServer lexicon function. This option may not be used with the XPath based configuration properties described above. If both are configured for a job, the lexicon function takes precedence.

To use a lexicon function for input, implement a subclass of one of the lexicon wrapper functions in com.marklogic.mapreduce.functions. For example, to use cts:element-values, implement a subclass of ElementValues. Override the methods corresponding to the function parameter value you want to include in the call.

For details, see "Using a Lexicon to Generate Key-Value Pairs" in the MarkLogic Connector for Hadoop Developer's Guide.

Configuring the Input Query in Advanced Mode

In advanced input mode, you must supply an input split query and an input query.

The split query is used to generate meta-data for Hadoop's input splits. This query must return a sequence of triples, each of which includes a forest id, record (fragment) count, and list of host names. The count may be an estimate.

The input query is used to fetch the input data for each map task. This query must return data that matches the configured InputFormat subclass.

MarkLogic Connector for Hadoop Version 2.4

Introduction

Configuration

Configuring the Input Query With a Path Expression

Configuring the Input Query with a Lexicon Function

Configuring the Input Query in Advanced Mode