See: Description
Package | Description |
---|---|
com.marklogic.dom |
This package provides W3C DOM based APIs to the
internal on-disk representation of documents and their contents in
the expanded tree cache of a MarkLogic database forest.
|
com.marklogic.io | |
com.marklogic.mapreduce |
MarkLogic Connector for Hadoop core
interfaces.
|
com.marklogic.mapreduce.examples |
Examples of using MarkLogic Server in MapReduce
jobs.
|
com.marklogic.mapreduce.functions |
Interfaces for using a MarkLogic Server lexicon
as an input source.
|
com.marklogic.mapreduce.test | |
com.marklogic.mapreduce.utilities | |
com.marklogic.tree |
This bundle provides an API for a MarkLogic Server content connector for Apache Hadoop MapReduce. The overview covers the following topics:
For detailed information, see the MarkLogic Connector for Hadoop Developer's Guide.
The MarkLogic Connector for Hadoop API allows you to use MarkLogic Server as either or both a Hadoop MapReduce input source and an output destination.
The following classes are provided for defining MarkLogic-specific key and value types for your MapReduce key-value pairs:
NodePath
for
keysDocumentURI
for
keysMarkLogicNode
for valuesYou may also use Apache Hadoop MapReduce types such as Text in
certain circumstances. See ValueInputFormat
KeyValueInputFormat
.
You may generate input data using MarkLogic Server lexicon
functions by subclassing one of the lexicon function wrapper
classes in com.marklogic.mapreduce.functions. Use lexicon functions
with ValueInputFormat
and KeyValueInputFormat
.
The following classes are provided for defining MarkLogic-specific MapReduce input and output formats. Input and output formats need not be the same type.
DocumentInputFormat
NodeInputFormat
ValueInputFormat
KeyValueInputFormat
ContentOutputFormat
NodeOutputFormat
PropertyOutputFormat
Configure the connector using the standard Hadoop configuration
mechanism. That is, use a Hadoop configuration file to define
property values, or set properties programmatically on your Job's
Configuration
object.
The configuration properties available for the connector are
described in MarkLogicConstants
.
When using MarkLogic Server as an input source for MapReduce
tasks, you may use either basic or advanced input mode. The default
is basic
mode. The mode is controlled through the
mapreduce.marklogic.input.mode
property. The following sections describe the input modes briefly.
For details, see the MarkLogic Connector for Hadoop Developer's
Guide.
In basic mode, you may supply components of an XQuery path expression which the connector uses to generate input data. You may not use this option along with a lexicon function class.
To allow MarkLogic Server to optimize the input query, the path
expression is constructed from two components: A
document node selector
and a
sub-document expression
.
The input split is not configurable in basic
mode.
The splits are based on a rough count of the number of fragments in
each forest. Use advanced
input mode for more control
over input split generation.
Conceptually, the input data for each task is constructed from a path expression similar to:
$document-selector/$subdocument-expression
Both components of the input path expression are optional. If no
document selector is given, fn:collection()
is used.
If no subdocument expression is given, the document nodes returned
by the document selector are used as the input values.
Examples:
document selector: none
subdocument expression: none
=> All document nodes in fn:collection()
document selector: fn:collection("wiki-topics")
subdocument expression: none
=> All document nodes in the "wiki-topics" collection
document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]
=> All elements in the "wiki-topics" collection containing hrefs
document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]/@title
=> The titles of all documents in the "wiki-topics" collection
containing hrefs
In basic mode, you may gather input data using a MarkLogicServer lexicon function. This option may not be used with the XPath based configuration properties described above. If both are configured for a job, the lexicon function takes precedence.
To use a lexicon function for input, implement a subclass of one
of the lexicon wrapper functions in
com.marklogic.mapreduce.functions. For example, to use
cts:element-values
, implement a subclass of ElementValues
.
Override the methods corresponding to the function parameter value
you want to include in the call.
For details, see "Using a Lexicon to Generate Key-Value Pairs" in the MarkLogic Connector for Hadoop Developer's Guide.
In advanced
input mode, you must supply an
input split query
and an
input query
.
The split query is used to generate meta-data for Hadoop's input splits. This query must return a sequence of triples, each of which includes a forest id, record (fragment) count, and list of host names. The count may be an estimate.
The input query is used to fetch the input data for each map task. This query must return data that matches the configured InputFormat subclass.
Copyright © 2020 MarkLogic
Corporation. All Rights Reserved.
Complete online documentation for MarkLogic Server,
XQuery and related components may be found at
developer.marklogic.com