This chapter provides procedures for installing and configuring Apache Hadoop MapReduce and the MarkLogic Connector for Hadoop, and for running a simple MapReduce job that interacts with MarkLogic Server. For more examples, see Using the Sample Applications.
This chapter includes the following sections:
This section covers the following topics:
The MarkLogic Connector for Hadoop is a Java-only API and is only available on Linux. You can use the connector with any of the Hadoop distributions listed below. Though the Hadoop MapReduce Connector is only supported on the Hadoop distributions listed below, it may work with other distributions, such as an equivalent version of Apache Hadoop.
The following software is required to use the MarkLogic Connector for Hadoop:
Apache Hadoop only supports the Oracle/Sun JDK, though other JDK's may work. For details, see http://wiki.apache.org/hadoop/HadoopJavaVersions.
The user with which a MapReduce job accesses MarkLogic Server must have appropriate privileges for the content accessed by the job, such as permission to read or update documents in the target database. Specify the user in the mapreduce.marklogic.input.username
and mapreduce.marklogic.output.username
job configuration properties. See Configuring a MapReduce Job.
In addition, the input and output user must use one of the pre-defined roles listed below:
The hadoop-internal
role is for internal use only. Do not assign this role to any users. This role is used to amp special privileges within the context of certain functions of the Hadoop MapReduce Connector. Assigning this role to users gives them privileges on the system that you typically do not want them to have.
For details about roles and privileges, see the Security Guide.
This section assumes you have already installed Hadoop, according to the instructions for your distribution. Follow these instructions to install MarkLogic Connector for Hadoop in a single node Hadoop configuration. For information about installation in a Hadoop Cluster, see Making the Connector Available Across a Hadoop Cluster.
These instructions assume you have the following environment variables set:
HADOOP_CONF_DIR
: The directory containing your Hadoop Configuration files. This location is dependent on your Hadoop distribution. For example, CDH uses /etc/hadoop/conf
by default.JAVA_HOME
: The root of your JRE installation.Use the following procedure to install the MarkLogic Connector for Hadoop. You might need to modify some of the example commands, depending on your version of MarkLogic, the connector, or your Hadoop distribution.
/space/marklogic
contains the connector zip file and you install the MarkLogic Connector for Hadoop in /space/marklogic/mapreduce
:$ cd /space/marklogic $ mkdir mapreduce; cd mapreduce $ unzip ../Connector-for-Hadoop2-2.1.zip
/space/marklogic/xcc
, is referred to as $XCC_HOME
in this guide.$ cd /space/marklogic $ mkdir xcc; cd xcc $ unzip ../MarkXCC.Java-8.0.zip
Hadoop must be configured to find the MarkLogic Connector for Hadoop libraries before you can use MarkLogic Server in a MapReduce job. See Configuring Your Environment to Use the Connector.
Before using the MarkLogic Connector for Hadoop with your Hadoop installation for the first time, set the environment variables described in this section. Only HADOOP_CLASSPATH
is required, but the rest of this guide assumes you set the optional variables.
CONNECTOR_HOME
in your shell environment to facilitate using the example commands in this guide. The MarkLogic Connector for Hadoop installation directory is referred to as $CONNECTOR_HOME
in this guide. For example:$ export CONNECTOR_HOME=/space/marklogic/mapreduce
HADOOP_CLASSPATH
in your shell environment to include the MarkLogic Connector for Hadoop and XCC JAR files. For example, if using MarkLogic 9, the required libraries are:$CONNECTOR_HOME/lib/commons-modeler-2.0.1.jar
$CONNECTOR_HOME/lib/marklogic-mapreduce2-2.1.jar
$XCC_HOME/lib/marklogic-xcc-8.0.jar
For example, on Linux, use the following command (all on one line, with no whitespace):
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}: $CONNECTOR_HOME/lib/commons-modeler-2.0.1.jar: $CONNECTOR_HOME/lib/marklogic-mapreduce2-2.1.jar: $XCC_HOME/lib/marklogic-xcc-8.0.jar
The JAR file names can vary across MarkLogic Server releases.
LIBJARS
variable in your shell environment to the same JAR files you specified in HADOOP_CLASSPATH
, but separated by commas. This variable is used for the value of the Hadoop -libjars
option in the example commands. It tells Hadoop where to find the MarkLogic JAR files.For example, you can use the following command on Linux (all on one line, with no whitespace):
export LIBJARS= $CONNECTOR_HOME/lib/commons-modeler-2.0.1.jar, $CONNECTOR_HOME/lib/marklogic-mapreduce2-2.1.jar, $XCC_HOME/lib/marklogic-xcc-8.0.jar
Hadoop MapReduce and the MarkLogic Connector for Hadoop are now ready for use.
The section walks through configuring and running a simple HelloWorld sample job, assuming MarkLogic Server and Apache Hadoop are installed on the same single node, as described in Installing the MarkLogic Connector for Hadoop.
The following steps are covered:
The MarkLogic Connector for Hadoop requires a MarkLogic Server installation configured with an XDBC App Server. When you install MarkLogic Server, a suitable XDBC App Server attached to the Documents database comes pre-configured on port 8000.
The example commands in this guide assume you're using this port 8000 App Server and database, and therefore no additional setup is required.
However, you can choose to use a different database or App Server and database:
com.marklogic.output.databasename
configuration property when you follow the steps in Configuring the Job. (A similar property exists for overriding the default database when using MarkLogic for output.)This section covers loading the sample data in two ways: Using Query Console to load the data using simple XQuery, or using the MarkLogic Content Pump (mlcp) command.
MarkLogic Content Pump (mlcp) is a command line tool transferring content into or out of MarkLogic Server, or copying content between MarkLogic Server instances.
Before running this procedure, you should have mlcp installed and the mlcp bin/ directory on your path; for details, see Installation and Configuration in the mlcp User Guide.
Follow these instructions to initialize the input database using MarkLogic Content Pump (mlcp).
mkdir /space/examples/hello cd /space/examples/hello
mkdir data
<data><child>hello mom</child></data>
For example, run the following command:
cat > data/hello.xml <data><child>hello mom</child></data> ^D
<data><child>world event</child></data>
For example, run the following command:
cat > data/world.xml <data><child>world event</child></data> ^D
-output_uri_replace
option to strip off the directory prefix from the database document URI. For example:$ mlcp.sh import -username user -password password -host localhost \ -port 8000 -input_file_path /space/examples/hello/data \ -output_uri_replace "/space/examples/hello/data/,''"
hello.xml
and world.xml
in the database.You can also use mlcp to load files from HDFS by specifying an HDFS path for -input_file_path. For example, if your files are in HDFS under /user/me/hello/data
, then you could use the following command:
$ mlcp.sh import -username user -password password -host localhost \ -port 8000 -input_file_path hdfs:/user/me/hello/data \ -output_uri_replace "/user/me/hello/data/,''"
Follow these instructions to initialize the input database with the sample documents using Query Console. For details about Query Console, see the Query Console User Guide.
To load the database with the sample data:
myhost
, visit this URL in the browser:http://myhost:8000/qconsole
xquery version "1.0-ml"; let $hello := <data><child>hello mom</child></data> let $world := <data><child>world event</child></data> return( xdmp:document-insert("hello.xml", $hello), xdmp:document-insert("world.xml", $world) )
Content Source
dropdown, select the input XDBC App Server you configured for input in Selecting the App Server and Database.Text
as the output format and click Run to execute the query.hello.xml
and world.xml
in the database.Before running the HelloWorld
sample job, set the connector configuration properties that identify the MarkLogic Server user and instance for input and output.
Although the input and output MarkLogic Server instances and users can be different, this example configures the job to use the same host, port, and database for both input and output.
Configuration also includes an input and an output user name and password. Choose (or create) a MarkLogic user with sufficient privileges to access your XDBC App Server, and read and insert documents in the attached database. If using a non-admin user, assign the user to the hadoop-user-all
role. For details, see Security Requirements for MapReduce Jobs.
marklogic-hello-world.xml
configuration file from $CONNECTOR_HOME/conf
to your work area. For example: $ cp $CONNECTOR_HOME/conf/marklogic-hello-world.xml /space/examples/hello
marklogic-hello-world.xml
to configure your input and output host name, port, user name, and password. Set the following parameters to match your environment:mapreduce.marklogic.input.username mapreduce.marklogic.input.password mapreduce.marklogic.input.host mapreduce.marklogic.input.port mapreduce.marklogic.output.username mapreduce.marklogic.output.password mapreduce.marklogic.output.host mapreduce.marklogic.output.port
The configured input user must have sufficient privileges to access the XDBC App Server identified by the input host/port and to read documents from the input database.
The configured output user must have sufficient privileges to access the XDBC App Server identified by the output host/port and to insert documents in the output database.
For example, if your MarkLogic installation is on localhost and you use the pre-configured App Server on port 8000 with the username and password my-user and my-password for input, then your input connection related property settings should be similar to the following after editing:
<property> <name>mapreduce.marklogic.input.username</name> <value>my-user</value> </property> <property> <name>mapreduce.marklogic.input.password</name> <value>my-password</value> </property> <property> <name>mapreduce.marklogic.input.host</name> <value>localhost</value> </property> <property> <name>mapreduce.marklogic.input.port</name> <value>8000</value> </property>
Your output connection related property settings should have similar values.
The HelloWorld
sample reads the first word of text from the input documents, concatenates the words into a string, and saves the result as HelloWorld.txt
. Assuming the database contains only the documents created in Loading the Sample Data, the output document contains the phrase hello world. If your database contains additional documents, you get different results.
To view the sample code, see $CONNECTOR_HOME/src/com/marklogic/mapreduce/examples
.
Use the following procedure to run the example MapReduce job:
cd /space/examples/hello
hadoop
command is in your path.hadoop jar \ $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \ com.marklogic.mapreduce.examples.HelloWorld -libjars $LIBJARS \ -conf marklogic-hello-world.xml
The -conf
command line option tells Hadoop where to get application-specific configuration information. You can also add a configuration directory to HADOOP_CLASSPATH
.
As the job runs, Hadoop reports the job progress to stdout. If the sample job does not run or does not produce the expected results, see Troubleshooting and Debugging.
Near the end of the job output, you should see text similar to the following. Notice there are 2 map input records (hello.xml
and world.xml
), 2 map output records (the first word from each input record), and 1 reduce output record (HelloWorld.txt
).
timestamp INFO mapreduce.Job: map 100% reduce 100% timestamp INFO mapreduce.Job: Job jobId completed successfully timestamp mapreduce.Job: Counters: 33 File System Counters ... Map-Reduce Framework Map input records=2 Map output records=2 Map output bytes=20 Map output materialized bytes=30 Input split bytes=91 Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=30 Reduce input records=2 Reduce output records=1
Use Query Console to explore the output database and examine the output document, HelloWorld.txt
. The document should contain the phrase hello world.
If you do not see the expected output, see the tips in Troubleshooting and Debugging.
When you submit a MapReduce job to run on an Apache Hadoop cluster, the job resources must be accessible by the master Job Tracker node and all worker nodes. Job resources include the job JAR file, configuration files, and all dependent libraries. When you use the MarkLogic Connector for Hadoop in your job, this includes the connector and XCC JAR files.
You must always have the job resources available on the Hadoop node where you launch the job. Depending on the method you use to make the job resource available across the cluster, dependent JAR files, such as the MarkLogic Connector for Hadoop libraries must be on the HADOOP_CLASSPATH on the node where you launch the job, as described in Configuring Your Environment to Use the Connector.
Hadoop offers many options for making job resources available to the worker nodes, including:
-libjars
Hadoop command line option and parsing the options in your main class using org.apache.hadoop.util.GenericOptionsParser
.The best solution depends upon the needs of your application and environment. See the Apache Hadoop documentation for more details on making resources available across a Hadoop cluster. This guide uses -libjars
.
The MarkLogic Connector for Hadoop is developed and maintained as an open source project on GitHub. To access the sources or contribute to the project, navigate to the following URL in your browser:
http://github.com/marklogic/marklogic-contentpump
The GitHub project includes both the connector and the mlcp command line tool.
The MarkLogic Connector for Hadoop distribution has the following layout:
Document or Directory | Description |
---|---|
conf/ |
The XML config files for the sample applications. For details, see Using the Sample Applications. |
docs/ |
The Javadoc for the connector in both expanded HTML and compressed zip format. |
lib/ |
The connector and connector examples JAR files, marklogic-mapreduce2- version.jar and marklogic-mapreduce-examples- version.jar . Note that the JAR file names include the version number, so the names in your installation might be slightly different. |
src/ |
The source code for the sample applications. |
sample-data/ |
The data used by several of the examples. For details, see Using the Sample Applications. |