This chapter covers the following topics related to using the sample applications:
The following topics apply to preparing to run all the sample applications:
The LinkCount
samples, such as LinkCountInDoc
and LinkCountValue,
require additional preparation. See Additional Sample Data Setup.
For details about the individual samples, see Sample Applications.
Install and configure MarkLogic Server, Hadoop MapReduce, and the MarkLogic Connector for Hadoop. For instructions, see Getting Started with the MarkLogic Connector for Hadoop.
The samples require at least one MarkLogic Server database and XDBC App Server. The examples in this chapter assume you're using the XDBC App Server on port 8000.
The LinkCount
family of samples require a specific database configuration and data set; see Additional Sample Data Setup. The other samples can be run against any XDBC App Server and database.
Getting Started with the MarkLogic Connector for Hadoop describes setting up a single-host configuration, where MarkLogic Server, Hadoop MapReduce, and the MarkLogic Connector for Hadoop are installed on the same host, and Hadoop MapReduce is configured for standalone operation. A multi-host configuration, with Hadoop MapReduce configured for pseudo-distributed or fully-distributed operation, more accurately represents a production deployment.
If you choose to use a multi-host, distributed configuration be aware of the following:
LinkCount
samples expect the same database for both input and output.Some of the samples use HDFS for input or output. If Hadoop is configured for pseudo- or fully-distributed operation, HDFS must be initialized before running the samples.
To check whether or not HDFS is initialized, run the following command. It should run without error. For example:
$ hdfs dfs -ls / drwxr-xr-x - marklogic\me mygroup 0 2011-07-19 10:48 /tmp drwxr-xr-x - marklogic\me mygroup 0 2011-07-19 10:51 /user
If the command fails, HDFS might not be initialized. See Initializing HDFS.
Before you begin, you should have the hadoop
and java
commands on your path. You should also set the environment variables covered in Configuring Your Environment to Use the Connector.
The sample applications include MapReduce configuration files containing MarkLogic Connector for Hadoop settings. To run the examples, you will have to modify these files. Therefore, you should copy the configuration files to a local directory of your choosing.
For example, to copy the configuration files to /space/examples/conf
, use the following command:
cp $CONNECTOR_HOME/conf/*.xml /space/examples/conf
Place the directory containing your copy of the configuration files on HADOOP_CLASSPATH
so that each sample job can find its configuration file. For example:
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/space/examples/conf
For each sample you plan to run, modify the MarkLogic Connector for Hadoop sample configuration file in your Hadoop configuration directory to match your MarkLogic Server configuration.
The configuration file associated with each sample is listed below.
Sample | Configuration File |
---|---|
HelloWorld | marklogic-hello-world.xml |
LinkCountInDoc | marklogic-nodein-nodeout.xml |
LinkCountInProperty | marklogic-textin-propout.xml |
LinkCountValue | marklogic-textin-textout.xml |
LinkCountCooccurrences | marklogic-lexicon.xml |
LinkCount | marklogic-advanced.xml |
RevisionGrouper | marklogic-nodein-qryout.xml |
BinaryReader | marklogic-subbinary.xml |
ContentReader | marklogic-docin-textout.xml |
ContentLoader | marklogic-textin-docout.xml |
ZipContentLoader | marklogic-textin-docout.xml |
The configuration properties requiring modification vary from sample to sample. For example, a sample which uses MarkLogic Server for input and HDFS for output will not include mapreduce.marklogic.output.*
properties.
If the sample uses MarkLogic Server for input, modify at least the following config properties. For details, see Identifying the Input MarkLogic Server Instance.
If the sample uses MarkLogic Server for output, modify at least the following config properties. For details, see Identifying the Output MarkLogic Server Instance.
Some samples might require additional customization. For details on a specific sample, see Sample Applications.
The following samples require a special database configuration and input data set. If you do not plan to run these samples, you can skip this section.
This section walks you through creating the MarkLogic Server environment required by these samples.
Use the following information to create a database named hadoop-samples with 2 forests and 2 attribute range indexes. You can use a different database name. Use the defaults for any configuration parameters not mentioned in this example.
For detailed instructions, see Creating and Configuring Forests and Databases and Defining Attribute Range Indexes in the Administrator's Guide.
You can skip this step if you use the pre-configured XDBC App Server on port 8000.
Use the following information to create an XDBC App Server and attach it to the hadoop-samples database created in the previous section. You can use a different name and port.
For detailed instructions, see Creating and Configuring App Servers in the Administrator's Guide.
Configuration Parameter | Setting |
---|---|
xdbc server name | hadoop-samples-xdbc |
root | (any) |
port | 9002 |
database | hadoop-samples |
Load the data from $CONNECTOR_HOME/sample-data
into the hadoop-samples
database with a URI prefix of enwiki/
. The instructions in this section use MarkLogic Content Pump (mlcp) to load the data, but you can choose a different method.
mlcp.sh
command on your path. For example:export PATH=${PATH}:MLCP_INSTALL_DIR/bin
-username
, -password
, -host
, and -port
options to match your environment.mlcp.sh import -host localhost -port 8000 -database hadoop-samples \ -username user -password password -mode local \ -input_file_path $CONNECTOR_HOME/sample-data/ -document_type xml \ -output_uri_replace "$CONNECTOR_HOME/sample-data,'enwiki'"
hadoop-samples
database and observe the database contains 93 documents, all with an enwiki/ prefix.Some of the samples use HDFS for input or output. This section briefly summarizes how to copy data into or retrieve data from HDFS when using Hadoop in pseudo-distributed or fully-distribute configurations.
If you use Hadoop MapReduce standalone, you can skip this section. Standalone Hadoop is the configuration created in Getting Started with the MarkLogic Connector for Hadoop. In a standalone configuration, HDFS uses the local file system directly. You do not need to initialize HDFS, and you can use normal Unix commands to work with the input and output files. You may still use HDFS commands to examine the file system.
This section covers following topics related to pseudo- and fully-distributed HDFS operation:
Use the following command see all available HDFS commands, or consult the documentation for your Hadoop distribution.
$ hdfs dfs -help
If you use Hadoop MapReduce in pseudo-distributed or fully-distributed mode, HDFS must be formatted before you can run the samples. If your HDFS installation is not already initialized, consult the documentation for your Hadoop distribution for instructions.
For example, with Apache Hadoop, you can run the following command to initialize HDFS:
$ hdfs namenode -format
Near the end of the output, you should see a message that HDFS has been successfully formatted. For example:
... ************************************************************/ 11/10/03 09:35:14 INFO namenode.FSNamesystem:... 11/10/03 09:35:14 INFO namenode.FSNamesystem: supergroup=supergroup 11/10/03 09:35:14 INFO namenode.FSNamesystem: isPermissionEnabled=true 11/10/03 09:35:14 INFO common.Storage: Image file of size 98 saved ... 11/10/03 09:35:14 INFO common.Storage: Storage directory /tmp/hadoop-sample/dfs/name has been successfully formatted. 11/10/03 09:35:14 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at sample.marklogic.com... ************************************************************/
If formatting succeeds, you can successfully use the ls
command to examine HDFS. For example:
$ hdfs dfs -ls /
Some of the sample applications store results to HDFS. You can browse HDFS and examine results from the command line or through your web browser.
In pseudo-distributed or fully distributed configurations, HDFS output pathnames given on the command line of an example are relative to /user/your_username
in HDFS by default. For example if you run the LinkCountValue
example, which saves results to HDFS, and specify the output directory as linkcountvalue
, then the results are in HDFS under /user/your_username/linkcountvalue
.
To access HDFS through your web browser, use the HDFS NameNode administration page. By default, this interface is available on port 50070 on the NameNode host; consult the documentation for your Hadoop distribution. Assuming localhost is the NameNode, browse to this URL and click on the Browse the file system link near the top of the page to browse HDFS:
http://localhost:50070
To browse HDFS from the command line, use a command similar to the following:
$ hdfs dfs -ls /user/your_username
For example if you run the LinkCountValue
example and specify the output directory as linkcountvalue
, you would see results similar to the following, after running the example:
$ hdfs dfs -ls /user/me/linkcountvalue drwxr-xr-x - me mygroup ... /user/me/linkcountvalue/_logs -rw-r--r-- 1 me mygroup ... /user/me/linkcountvalue/part-r-00000
The results are in the part-r-XXXXX
file. To see the last few lines of the results, use a command similar to the following:
$ hdfs dfs -tail /user/me/linkcountvalue/part-r-00000
To copy the result from HDFS to your system's file system, use a command similar to the following:
$ hdfs dfs -get /user/me/linkcountvalue/part-r-00000 \ /my/destination/linkcountvalue.txt
Some of the samples use HDFS for input. These samples require you to copy the input data to HDFS before running the sample. Place the input files under /user/your_username
in HDFS using a command such as the following:
$ hdfs dfs -put ./mycontent.zip /user/me/zipcontentloader
Relative pathnames are relative to /user/your_username
, so to check the file copied into HDFS above, use a command similar to the following:
$ hdfs dfs -ls /user/me/zipcontentloader -rw-r--r-- 1 me mygroup ... /user/me/zipcontentloader/mycontent.zip
When you copy files into HDFS, there must not be a pre-existing file of the same name.
This section contains detailed instructions for running each of the samples summarized in the table below.
The MarkLogic Connector for Hadoop distribution includes the following resources related to the sampoles:
$CONNECTOR_HOME/src
.$CONNECTOR_HOME/lib/marklogic-mapreduce-examples-
version.jar
.com.marklogic.mapreduce.examples
in the Javadoc under $CONNECTOR_HOME/docs
.Sample | Input | Output | Description |
---|---|---|---|
HelloWorld | MarkLogic Server | MarkLogic Server | Reads the first word from text in input XML documents, concatentates the words, then stores the results as a new text document in MarkLogic Server. |
LinkCountInDoc | MarkLogic Server | MarkLogic Server | Counts href link title attributes in documents in MarkLogic Server, then stores the count as a child node of the referenced document. |
LinkCountInProperty | MarkLogic Server | MarkLogic Server | Counts href link title attributes in documents in MarkLogic Server, then stores the count as a property of the referenced document. |
LinkCountValue | MarkLogic Server | HDFS | Counts href link titles attributes in documents in MarkLogic Server, then stores the counts in HDFS text files. |
LinkCountCooccurrences | MarkLogic Server | HDFS | Counts href link title attributes in documents in MarkLogic Server using a lexicon function, then stores the counts in HDFS text files. |
LinkCount | MarkLogic Server | HDFS | Equivalent to LinkCountValue , but demonstrates using advanced input mode to provide your own input split and input queries. |
RevisionGrouper | MarkLogic Server | MarkLogic Server | Demonstrates the use of a custom output query, using KeyValueOutputFormat . |
BinaryReader | MarkLogic Server | HDFS | Demonstrates using advanced input mode with an input query optimized using the split range. |
ContentReader | MarkLogic Server | HDFS | Reads documents in a MarkLogic Server database, using an SSL-enabled connection, then writes the contents to HDFS text files. |
ContentLoader | HDFS | MarkLogic Server | Reads text files in HDFS, then stores the contents as documents in a MarkLogic Server database. |
ZipContentLoader | HDFS | MarkLogic Server | Reads text files from zip files in HDFS, then stores the contents as documents in a MarkLogic Server database. |
This example extracts the first word from all the XML documents in a MarkLogic Server database containing text nodes, sorts the words, concatenates them into a single string, and saves the result as a text document in MarkLogic Server. The example uses basic input mode with the default document selector and subexpression expression. The example uses MarkLogic Server for both input and output.
For detailed instructions on configuring and running this sample, see Running the HelloWorld Sample Application.
Though you can use the sample with any input documents, it is intended to be used with a small data set. It is not optimized for efficient resource use across large data sets. Only XML documents with text nodes contribute to the final results.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-hello-world.xml
Use the following command to run the example job, with suitable substitution for $CONNECTOR_HOME
and the connector version:
hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.HelloWorld -libjars $LIBJARS marklogic-hello-world.xml
To view the results, use Query Console to explore the output database. The sample creates HelloWorld.txt
. If you use the input data from Configuring the Job, HelloWorld.txt
should contain the phrase hello world.
This example calculates reference counts for each document in a set of Wikipedia-based documents, and then stores the reference count for each document as a new <ref-count>
child node of the document. The example uses MarkLogic Server for input and output.
Before running the sample, follow the instructions in Additional Sample Data Setup.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-nodein-nodeout.xml
Use the following command to run the example job, with a suitable substitution for $CONNECTOR_HOME
and the connector version:
hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.LinkCountInDoc \ -libjars $LIBJARS marklogic-nodein-nodeout.xml
The intra-collection reference counts are stored as new <ref-count>
elements under the root of each document. Run the following XQuery in Query Console against the hadoop-samples
database to see a list of documents ref-count
elements:
xquery version "1.0-ml"; for $ref in //ref-count return fn:concat(xdmp:node-uri($ref)," ",$ref/text())
You should see results similar to the following:
enwiki/Ayn Rand 1 enwiki/List of characters in Atlas Shrugged 4 enwiki/Academy Award for Best Art Direction 1 enwiki/Academy Award 2 enwiki/Aristotle 5
This example calculates reference counts for each document in a set of Wikipedia-base documents, and then stores the reference count for each document as a property of the document. The examples uses MarkLogic Server for input and output.
Before running the sample, follow the instructions in Additional Sample Data Setup.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-textin-propout.xml
Use the following command to run the example job, with a suitable substitution for $CONNECTOR_HOME
and the connector version:
hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.LinkCountInProperty \ -libjars $LIBJARS marklogic-textin-propout.xml
The intra-collection reference counts are stored as new <ref-count>
property elements of each document. Run the following query in Query Console against your XDBC App Server to see a list of documents with at least 20 references:
xquery version "1.0-ml"; for $ref in xdmp:document-properties()//ref-count return fn:concat(xdmp:node-uri($ref)," ",$ref/text())
You should see results similar to the following:
enwiki/Ayn Rand 1 enwiki/List of characters in Atlas Shrugged 4 enwiki/Academy Award for Best Art Direction 1 enwiki/Academy Award 2 enwiki/Aristotle 5
This example calculates reference counts for each document in a set of Wikipedia-base documents, and then stores the reference counts in HDFS. The examples uses MarkLogic Server for input and HDFS for output.
Before running the sample, follow the instructions in Additional Sample Data Setup.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-textin-textout.xml
Use the following command to run the example job, with suitable substitutions for $CONNECTOR_HOME
, the HDFS_OUTPUT_DIR
, and the connector version:. The HDFS_OUTPUT_DIR
must not already exist.
hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.LinkCountValue \ -libjars $LIBJARS marklogic-textin-textout.xml HDFS_OUTPUT_DIR
To view the results, examine the results in HDFS_OUTPUT_DIR
, as described in Interacting with HDFS. For example, if you use /home/you/lcv
for HDFS_OUTPUT_DIR
:
$ hdfs dfs -ls /home/you/lcv ... part-r-00000 $ hdfs dfs -cat /home/you/lcv/part-r-00000 | grep ^Aristotle Aristotle 5
Each topic title is followed by a reference count. The raw output differs from the results for the LinkCountInDoc
and LinkCountInProperty
examples because LinkCountValue
generates counts for all references, rather than only for documents in the database.
This example calculates reference counts for each document in a set of Wikipedia-base documents, and then stores the reference counts in HDFS. The examples uses MarkLogic Server for input and HDFS for output. This example is the same as the LinkCountValue
example, but it uses advanced input mode instead of basic input mode.
Before running the sample, follow the instructions in Additional Sample Data Setup.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-advanced.xml
Use the following command to run the example job, with suitable substitutions for $CONNECTOR_HOME
, the HDFS_OUTPUT_DIR
, and the connector version:
hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.LinkCount \ -libjars $LIBJARS marklogic-advanced.xml HDFS_OUTPUT_DIR
To view the results, examine the results in HDFS_OUTPUT_DIR
, as described in Interacting with HDFS. For example, if you use /home/you/lc
for HDFS_OUTPUT_DIR
:
$ hdfs dfs -ls /home/you/lc ... part-r-00000 $ hdfs dfs -cat /home/you/lcv/part-r-00000 | grep ^Aristotle Aristotle 5
Each topic title is followed by a reference count. The raw output differs from the results for the LinkCountInDoc
and LinkCountInProperty
examples because LinkCount
generates counts for all references, rather than only for documents in the database.
For details on advanced input mode, see Advanced Input Mode.
This example calculates reference counts for each document in a set of Wikipedia-base documents by using an element attribute lexicon, and then stores the reference counts in HDFS. The examples uses MarkLogic Server for input and HDFS for output.
The sample uses com.marklogic.mapreduce.functions.ElemAttrValueCooccurrences
(a wrapper around cts:element-attribute-value-co-occurrences) to find all href
attributes which occur along with title
attributes in side anchor tags. The attribute range indexes created in Additional Sample Data Setup support this operation. The map input key-value pairs are (Text, Text)
pairs where the key is the href and the value is the title.
Before running the sample, follow the instructions in Additional Sample Data Setup.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-lexicon.xml
Use the following command to run the example job, with suitable substitutions for $CONNECTOR_HOME
, HDFS_OUTPUT_DIR
, and the connector version:
hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.LinkCountCooccurrences \ -libjars $LIBJARS marklogic-lexicon.xml HDFS_OUTPUT_DIR
To view the results, examine the results in HDFS_OUTPUT_DIR
, as described in Interacting with HDFS. For example, if you use /home/you/lc
for HDFS_OUTPUT_DIR
:
$ hdfs dfs -ls /home/you/lcco ... part-r-00000 $ hdfs dfs -cat /home/you/lcco/part-r-00000 | grep ^Aristotle Aristotle 5
Each topic title is followed by a reference count. The raw output differs from the results for the LinkCountInDoc
and LinkCountInProperty
examples because LinkCountCooccurrences
generates counts for all references, rather than only for documents in the database.
This sample application demonstrates using KeyValueOutputFormat
and a custom output query. The sample places each document in a collection of Wikipedia articles into a collection, based on the year the article was last revised. The sample uses MarkLogic Server for input and output. The job has no reduce phase.
Before running the sample, follow the instructions in Additional Sample Data Setup
The map function input key-value pairs that are the revision timestamp nodes matching the XPath expression fn:collection()//wp:revision/wp:timestamp
, using the expression in mapreduce.marklogic.input.subdocumentexpr
. These nodes are of the form:
<timestamp>2007-09-28T08:07:26Z</timestamp>
The map function picks the year (2007) off the timestamp and generates output key-value pairs where the key is the document URI as a string and the value is the year as a string. Each pair is then passed to the output query defined in mapreduce.marklogic.output.query
, which adds the document named in the key to a collection named after the year.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-nodein-qryout.xml
Use the following command to run the example job, with a suitable substitution for $CONNECTOR_HOME
and the connector version:
hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.RevisionGrouper \ -libjars $LIBJARS marklogic-nodein-qryout.xml
To view the results, use Query Console to explore the hadoop-samples database. You should see the documents are now in collections based on the year in which they were revised. Alternatively, run a query similar to the following to see a list of documents in the collection for the year 2009:
xquery version "1.0-ml"; for $d in fn:collection("2009") return xdmp:node-uri($d)
This sample application demonstrates using the mapreduce.marklogic.output.bindsplitrange
configuration property with advanced input mode. The sample extracts the first 1K bytes from each (binary) document in a database and saves the result in HDFS. The sample uses MarkLogic Server for input and HDFS for output. The sample has no reduce phase.
The input query defined in marlogic-subbinary.xml
uses the splitstart
and splitend
external variables provided by the MarkLogic Connector for Hadoop to optimize input query performance. For details on this feature, see Optimizing Your Input Query.
The sample requires a database containing one or more binary documents, and an XDBC App Server. Since the sample assumes all documents in the database are binary documents, you should not use the database and content set up in Additional Sample Data Setup.
Follow these steps to set up the BinaryReader
sample application:
marklogic-subbinary.xml
to configure the job to use the App Server created in Step 2. For details, see Modify the Sample Configuration Files.HDFS_OUTPUT_DIR
and the connector version:hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.BinaryReader \ -libjars $LIBJARS marklogic-subbinary.xml HDFS_OUTPUT_DIR
If using standalone Hadoop, view the results in HDFS_OUTPUT_DIR
. If using pseudo-distributed or fully-distributed Hadoop, view the results using the Hadoop hdfs
command. For example:
$ hdfs dfs -ls HDFS_OUTPUT_DIR
This sample application writes documents in a MarkLogic Server database to the HDFS file system, using an SSL-enabled connection to MarkLogic Server. The input database is the database associated with your XDBC App Server. This sample uses MarkLogic Server for input and HDFS for output.
This sample copies the entire contents of the database to HDFS. Choose a target database accordingly, or modify the sample's configuration file to limit the selected documents.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-docin-textout.xml
Before running the sample, configure your XDBC App Server and Java environment to use SSL. For details, see Making a Secure Connection to MarkLogic Server with SSL. You might need to import the MarkLogic Server self-signed certificate into your JRE default keystore using the Java keytool
utility. See the Java documentation for details on adding certificates to the default keystore.
Run the sample as follows, substituting appropriate paths for $CONNECTOR_HOME
, HDFS_OUTPUT_DIR
. and the connector version.The output directory must not already exist.
hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.ContentReader \ -libjars $LIBJARS marklogic-docin-textout.xml HDFS_OUTPUT_DIR
If using standalone Hadoop, view the results in ~/HDFS_OUTPUT_DIR
. If using pseudo-distributed or fully-distributed Hadoop, view the results using the Hadoop hdfs
command. For example:
$ hdfs dfs -ls HDFS_OUTPUT_DIR $ hdfs dfs -tail HDFS_OUTPUT_DIR/part-m-00000
This sample application loads files in an HDFS directory into a MarkLogic Server database as documents. The destination database is the database associated with your XDBC App Server. The sample uses HDFS for input and MarkLogic Server for output. This sample is a map-only job. That is, there is no reduce step.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-textin-docout.xml
Follow these steps to run the ContentLoader
sample application:
$ mkdir ~/input_dir
$ hdfs dfs -mkdir input_dir
$ cp your_input_files ~/input_dir # standalone $ hdfs dfs -put your_input_files input_dir # distributed
marklogic-textin-docout.xml
to set the output content type. For example, to load binary files, add:<property> <name>mapreduce.marklogic.output.content.type</name> <value>binary</value> </property>
$CONNECTOR_HOME,
input_dir
, and the connector version:hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.ContentLoader \ -libjars $LIBJARS marklogic-textin-docout.xml input_dir
The document URIs in the database correspond to the HDFS path. For example, if one of the input documents is located in HDFS on samples.marklogic.com
as /user/guest/data/file1.xml
, then the document URI in the database is:
hdfs://samples.marklogic.com/user/guest/data/file1.xml
If you receive an error similar to the following, then you must change the directory creation database configuration setting to manual.
java.lang.IllegalStateException: Manual directory creation mode is required.
This sample application loads the contents of zip files in an HDFS directory into a MarkLogic Server database as documents. The destination database is the database associated with your XDBC App Server. The sample uses HDFS for input and MarkLogic Server for output. This sample is a map-only job. That is, there is no reduce step.
This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.
marklogic-textin-docout.xml
Follow these steps to run the ZipContentLoader
sample application:
$ mkdir ~/zip_input_dir
$ hdfs dfs -mkdir zip_input_dir
$ cp your_input_files ~/zip_input_dir # standalone $ hdfs dfs -put your_data.zip zip_input_dir # distributed
marklogic-textin-docout.xml
. For example, to load binary files, add:<property> <name>mapreduce.marklogic.output.content.type</name> <value>binary</value> </property>
$CONNECTOR_HOME,
zip_input_dir
, and the connector version:hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.ZipContentLoader \ -libjars $LIBJARS marklogic-textin-docout.xml zip_input_dir
The document URIs in the database correspond to the paths within the zip file. For example, if the zip file contents are rooted at a folder named enwiki, and that folder contains a file named Wisconsin, then the resulting document URI is:
enwiki/Wisconsin
If you receive an error similar to the following, use the Admin Interface to change the directory creation database configuration setting to manual:
java.lang.IllegalStateException: Manual directory creation mode is required.