VALUE
- Only ForestDocument is currently
supported, but types such as Text or BytesWritable are possible
candidates to be added.public class ForestInputFormat<VALUE> extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE> implements MarkLogicConstants
FileInputFormat
subclass for
reading documents from a forest using DirectAccess.
Direct Access is intended primarily for extracting documents in offline or read-only forests, such as forests containing archived data that are part of a Tiered Storage data management strategy.
This format produces key-value pairs where the key is a DocumentURI
and
the value is a ForestDocument
.
The type of ForestDocument
depends on the underlying
document content type: DOMDocument
for
XML or text, or BinaryDocument
for binaries. Binary documents can be further specialized to
RegularBinaryDocument
or LargeBinaryDocument
,
depending on size and the database configuration.
Modifier and Type | Field and Description |
---|---|
static
org.apache.commons.logging.Log |
LOG |
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR,
INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES,
PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
ADVANCED_MODE,
ASSIGNMENT_POLICY,
BASIC_MODE,
BATCH_SIZE,
BIND_SPLIT_RANGE,
COLLECTION_FILTER,
CONTENT_TYPE,
COPY_COLLECTIONS,
COPY_METADATA,
COPY_QUALITY,
DEFAULT_BATCH_SIZE,
DEFAULT_CONTENT_TYPE,
DEFAULT_LOCAL_MAX_SPLIT_SIZE,
DEFAULT_MAX_SPLIT_SIZE,
DEFAULT_OUTPUT_CONTENT_ENCODING,
DEFAULT_OUTPUT_XML_REPAIR_LEVEL,
DEFAULT_PROPERTY_OPERATION_TYPE,
DEFAULT_TXN_SIZE,
DIRECTORY_FILTER,
DOCUMENT_SELECTOR,
EXECUTION_MODE,
EXTRACT_URI,
INDENTED,
INPUT_DATABASE_NAME,
INPUT_HOST,
INPUT_KEY_CLASS,
INPUT_LEXICON_FUNCTION_CLASS,
INPUT_MODE,
INPUT_PASSWORD,
INPUT_PORT,
INPUT_QUERY,
INPUT_QUERY_LANGUAGE,
INPUT_QUERY_TIMESTAMP,
INPUT_RESTRICT_HOSTS,
INPUT_SSL_OPTIONS_CLASS,
INPUT_SSL_PROTOCOL,
INPUT_USE_SSL,
INPUT_USERNAME,
INPUT_VALUE_CLASS,
MAX_SPLIT_SIZE,
MIN_NODEUPDATE_VERSION,
MODE_DISTRIBUTED,
MODE_LOCAL,
MR_NAMESPACE,
NODE_OPERATION_TYPE,
OUTPUT_CLEAN_DIR,
OUTPUT_COLLECTION,
OUTPUT_CONTENT_ENCODING,
OUTPUT_CONTENT_LANGUAGE,
OUTPUT_CONTENT_NAMESPACE,
OUTPUT_DATABASE_NAME,
OUTPUT_DIRECTORY,
OUTPUT_FAST_LOAD,
OUTPUT_FOREST_HOST,
OUTPUT_GRAPH,
OUTPUT_HOST,
OUTPUT_KEY_TYPE,
OUTPUT_KEY_VARNAME,
OUTPUT_NAMESPACE,
OUTPUT_OVERRIDE_GRAPH,
OUTPUT_PARTITION,
OUTPUT_PASSWORD,
OUTPUT_PERMISSION,
OUTPUT_PORT,
OUTPUT_PROPERTY_ALWAYS_CREATE,
OUTPUT_QUALITY,
OUTPUT_QUERY,
OUTPUT_QUERY_LANGUAGE,
OUTPUT_RESTRICT_HOSTS,
OUTPUT_SSL_OPTIONS_CLASS,
OUTPUT_SSL_PROTOCOL,
OUTPUT_STREAMING,
OUTPUT_URI_PREFIX,
OUTPUT_URI_REPLACE,
OUTPUT_URI_SUFFIX,
OUTPUT_USE_SSL,
OUTPUT_USERNAME,
OUTPUT_VALUE_TYPE,
OUTPUT_VALUE_VARNAME,
OUTPUT_XML_REPAIR_LEVEL,
PATH_NAMESPACE,
PROPERTY_OPERATION_TYPE,
QUERY_FILTER,
RECORD_TO_FRAGMENT_RATIO,
REDACTION_RULE_COLLECTION,
SPLIT_END_VARNAME,
SPLIT_QUERY,
SPLIT_START_VARNAME,
SUBDOCUMENT_EXPRESSION,
TEMPORAL_COLLECTION,
TXN_SIZE,
TYPE_FILTER
Constructor and Description |
---|
ForestInputFormat() |
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.mapreduce.RecordReader<DocumentURIWithSourceInfo,VALUE> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context) |
List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext job) |
protected List<org.apache.hadoop.fs.FileStatus> |
listStatus(org.apache.hadoop.mapreduce.JobContext job) |
addInputPath, addInputPathRecursively, addInputPaths,
computeSplitSize, getBlockIndex, getFormatMinSplitSize,
getInputDirRecursive, getInputPathFilter, getInputPaths,
getMaxSplitSize, getMinSplitSize, isSplitable, makeSplit,
makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths,
setInputPaths, setMaxInputSplitSize,
setMinInputSplitSize
public org.apache.hadoop.mapreduce.RecordReader<DocumentURIWithSourceInfo,VALUE> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException, InterruptedException
createRecordReader
in
class org.apache.hadoop.mapreduce.InputFormat<DocumentURIWithSourceInfo,VALUE>
IOException
InterruptedException
protected List<org.apache.hadoop.fs.FileStatus> listStatus(org.apache.hadoop.mapreduce.JobContext job) throws IOException
listStatus
in
class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE>
IOException
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job) throws IOException
getSplits
in
class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE>
IOException
Copyright © 2020 MarkLogic
Corporation. All Rights Reserved.
Complete online documentation for MarkLogic Server,
XQuery and related components may be found at
developer.marklogic.com