ForestInputFormat (MarkLogic Connector for Hadoop v2.4)

java.lang.Object
- org.apache.hadoop.mapreduce.InputFormat<K,V>
- - org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE>
  - - com.marklogic.mapreduce.ForestInputFormat<VALUE>

Type Parameters:

VALUE - Only ForestDocument is currently supported, but types such as Text or BytesWritable are possible candidates to be added.

All Implemented Interfaces:

MarkLogicConstants
```
public class ForestInputFormat<VALUE>
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE>
implements MarkLogicConstants
```
FileInputFormat subclass for reading documents from a forest using DirectAccess.
Direct Access is intended primarily for extracting documents in offline or read-only forests, such as forests containing archived data that are part of a Tiered Storage data management strategy.

This format produces key-value pairs where the key is a DocumentURI and the value is a ForestDocument. The type of ForestDocument depends on the underlying document content type: DOMDocument for XML or text, or BinaryDocument for binaries. Binary documents can be further specialized to RegularBinaryDocument or LargeBinaryDocument, depending on size and the database configuration.

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
  org.apache.hadoop.mapreduce.lib.input.FileInputFormat.Counter

Field Summary

Fields
Modifier and Type Field and Description

static org.apache.commons.logging.Log LOG
- Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
  DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
- Fields inherited from interface com.marklogic.mapreduce.MarkLogicConstants
  ADVANCED_MODE, ASSIGNMENT_POLICY, BASIC_MODE, BATCH_SIZE, BIND_SPLIT_RANGE, COLLECTION_FILTER, CONTENT_TYPE, COPY_COLLECTIONS, COPY_METADATA, COPY_QUALITY, DEFAULT_BATCH_SIZE, DEFAULT_CONTENT_TYPE, DEFAULT_LOCAL_MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE, DEFAULT_OUTPUT_CONTENT_ENCODING, DEFAULT_OUTPUT_XML_REPAIR_LEVEL, DEFAULT_PROPERTY_OPERATION_TYPE, DEFAULT_TXN_SIZE, DIRECTORY_FILTER, DOCUMENT_SELECTOR, EXECUTION_MODE, EXTRACT_URI, INDENTED, INPUT_DATABASE_NAME, INPUT_HOST, INPUT_KEY_CLASS, INPUT_LEXICON_FUNCTION_CLASS, INPUT_MODE, INPUT_PASSWORD, INPUT_PORT, INPUT_QUERY, INPUT_QUERY_LANGUAGE, INPUT_QUERY_TIMESTAMP, INPUT_RESTRICT_HOSTS, INPUT_SSL_OPTIONS_CLASS, INPUT_SSL_PROTOCOL, INPUT_USE_SSL, INPUT_USERNAME, INPUT_VALUE_CLASS, MAX_SPLIT_SIZE, MIN_NODEUPDATE_VERSION, MODE_DISTRIBUTED, MODE_LOCAL, MR_NAMESPACE, NODE_OPERATION_TYPE, OUTPUT_CLEAN_DIR, OUTPUT_COLLECTION, OUTPUT_CONTENT_ENCODING, OUTPUT_CONTENT_LANGUAGE, OUTPUT_CONTENT_NAMESPACE, OUTPUT_DATABASE_NAME, OUTPUT_DIRECTORY, OUTPUT_FAST_LOAD, OUTPUT_FOREST_HOST, OUTPUT_GRAPH, OUTPUT_HOST, OUTPUT_KEY_TYPE, OUTPUT_KEY_VARNAME, OUTPUT_NAMESPACE, OUTPUT_OVERRIDE_GRAPH, OUTPUT_PARTITION, OUTPUT_PASSWORD, OUTPUT_PERMISSION, OUTPUT_PORT, OUTPUT_PROPERTY_ALWAYS_CREATE, OUTPUT_QUALITY, OUTPUT_QUERY, OUTPUT_QUERY_LANGUAGE, OUTPUT_RESTRICT_HOSTS, OUTPUT_SSL_OPTIONS_CLASS, OUTPUT_SSL_PROTOCOL, OUTPUT_STREAMING, OUTPUT_URI_PREFIX, OUTPUT_URI_REPLACE, OUTPUT_URI_SUFFIX, OUTPUT_USE_SSL, OUTPUT_USERNAME, OUTPUT_VALUE_TYPE, OUTPUT_VALUE_VARNAME, OUTPUT_XML_REPAIR_LEVEL, PATH_NAMESPACE, PROPERTY_OPERATION_TYPE, QUERY_FILTER, RECORD_TO_FRAGMENT_RATIO, REDACTION_RULE_COLLECTION, SPLIT_END_VARNAME, SPLIT_QUERY, SPLIT_START_VARNAME, SUBDOCUMENT_EXPRESSION, TEMPORAL_COLLECTION, TXN_SIZE, TYPE_FILTER

Fields
Modifier and Type	Field and Description
`static org.apache.commons.logging.Log`	`LOG`

Constructor Summary

Constructors
Constructor and Description

ForestInputFormat()

Constructors
Constructor and Description
`ForestInputFormat()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`org.apache.hadoop.mapreduce.RecordReader<DocumentURIWithSourceInfo,VALUE>`	`createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)`
`List<org.apache.hadoop.mapreduce.InputSplit>`	`getSplits(org.apache.hadoop.mapreduce.JobContext job)`
`protected List<org.apache.hadoop.fs.FileStatus>`	`listStatus(org.apache.hadoop.mapreduce.JobContext job)`

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG

Constructor Detail
- ForestInputFormat
```
public ForestInputFormat()
```

Method Detail

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<DocumentURIWithSourceInfo,VALUE> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                    org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                             throws IOException,
                                                                                                    InterruptedException

Specified by:: createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<DocumentURIWithSourceInfo,VALUE>
Throws:: IOException; InterruptedException

listStatus

protected List<org.apache.hadoop.fs.FileStatus> listStatus(org.apache.hadoop.mapreduce.JobContext job)
                                                    throws IOException

Overrides:: listStatus in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE>
Throws:: IOException

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job)
                                                       throws IOException

Overrides:: getSplits in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE>
Throws:: IOException

Class ForestInputFormat<VALUE>

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Field Summary

Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Fields inherited from interface com.marklogic.mapreduce.MarkLogicConstants

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Methods inherited from class java.lang.Object

Field Detail

LOG

Constructor Detail

ForestInputFormat

Method Detail

createRecordReader

listStatus

getSplits