This chapter provides procedures for installing and configuring Apache Hadoop MapReduce and the MarkLogic Connector for Hadoop, and for running a simple MapReduce job that interacts with MarkLogic Server. For more examples, see Using the Sample Applications.
The MarkLogic Connector for Hadoop is a Java-only API and is only available on Linux. You can use the connector with any of the Hadoop distributions listed below. Though the Hadoop MapReduce Connector is only supported on the Hadoop distributions listed below, it may work with other distributions, such as an equivalent version of Apache Hadoop.
Apache Hadoop only supports the Oracle/Sun JDK, though other JDK's may work. For details, see http://wiki.apache.org/hadoop/HadoopJavaVersions.
The user with which a MapReduce job accesses MarkLogic Server must have appropriate privileges for the content accessed by the job, such as permission to read or update documents in the target database. Specify the user in the
mapreduce.marklogic.output.username job configuration properties. See Configuring a MapReduce Job.
|Enables use of MarkLogic Server as an input source for a MapReduce job. This role does not grant any other privileges, so the |
|Enables use of MarkLogic Server as an output destination for a MapReduce job. This role does not grant any other privileges, so the |
|Combines the privileges of |
hadoop-internal role is for internal use only. Do not assign this role to any users. This role is used to amp special privileges within the context of certain functions of the Hadoop MapReduce Connector. Assigning this role to users gives them privileges on the system that you typically do not want them to have.
For details about roles and privileges, see the Understanding and Using Security Guide.
This section assumes you have already installed Hadoop, according to the instructions for your distribution. Follow these instructions to install MarkLogic Connector for Hadoop in a single node Hadoop configuration. For information about installation in a Hadoop Cluster, see Making the Connector Available Across a Hadoop Cluster.
HADOOP_CONF_DIR: The directory containing your Hadoop Configuration files. This location is dependent on your Hadoop distribution. For example, CDH uses
JAVA_HOME: The root of your JRE installation.
Use the following procedure to install the MarkLogic Connector for Hadoop. You might need to modify some of the example commands, depending on your version of MarkLogic, the connector, or your Hadoop distribution.
/space/marklogiccontains the connector zip file and you install the MarkLogic Connector for Hadoop in
/space/marklogic/xcc, is referred to as
$XCC_HOMEin this guide.
Hadoop must be configured to find the MarkLogic Connector for Hadoop libraries before you can use MarkLogic Server in a MapReduce job. See Configuring Your Environment to Use the Connector.
Before using the MarkLogic Connector for Hadoop with your Hadoop installation for the first time, set the environment variables described in this section. Only
HADOOP_CLASSPATH is required, but the rest of this guide assumes you set the optional variables.
CONNECTOR_HOMEin your shell environment to facilitate using the example commands in this guide. The MarkLogic Connector for Hadoop installation directory is referred to as
$CONNECTOR_HOMEin this guide. For example:
HADOOP_CLASSPATHin your shell environment to include the MarkLogic Connector for Hadoop and XCC JAR files. For example, if using MarkLogic 8, the required libraries are:
LIBJARSvariable in your shell environment to the same JAR files you specified in
HADOOP_CLASSPATH, but separated by commas. This variable is used for the value of the Hadoop
-libjarsoption in the example commands. It tells Hadoop where to find the MarkLogic JAR files.
The section walks through configuring and running a simple HelloWorld sample job, assuming MarkLogic Server and Apache Hadoop are installed on the same single node, as described in Installing the MarkLogic Connector for Hadoop.
The MarkLogic Connector for Hadoop requires a MarkLogic Server installation configured with an XDBC App Server. When you install MarkLogic Server, a suitable XDBC App Server attached to the Documents database comes pre-configured on port 8000.
com.marklogic.output.databasenameconfiguration property when you follow the steps in Configuring the Job. (A similar property exists for overriding the default database when using MarkLogic for output.)
-output_uri_replaceoption to strip off the directory prefix from the database document URI. For example:
world.xmlin the database.
You can also use mlcp to load files from HDFS by specifying an HDFS path for -input_file_path. For example, if your files are in HDFS under
/user/me/hello/data, then you could use the following command:
Follow these instructions to initialize the input database with the sample documents using Query Console. For details about Query Console, see the Query Console User Guide.
myhost, visit this URL in the browser:
Content Sourcedropdown, select the input XDBC App Server you configured for input in Selecting the App Server and Database.
Textas the output format and click Run to execute the query.
world.xmlin the database.
Configuration also includes an input and an output user name and password. Choose (or create) a MarkLogic user with sufficient privileges to access your XDBC App Server, and read and insert documents in the attached database. If using a non-admin user, assign the user to the
hadoop-user-all role. For details, see Security Requirements for MapReduce Jobs.
marklogic-hello-world.xmlconfiguration file from
$CONNECTOR_HOME/confto your work area. For example:
marklogic-hello-world.xmlto configure your input and output host name, port, user name, and password. Set the following parameters to match your environment:
For example, if your MarkLogic installation is on localhost and you use the pre-configured App Server on port 8000 with the username and password 'my-user' and 'my-password' for input, then your input connection related property settings should be similar to the following after editing:
<property> <name>mapreduce.marklogic.input.username</name> <value>my-user</value> </property> <property> <name>mapreduce.marklogic.input.password</name> <value>my-password</value> </property> <property> <name>mapreduce.marklogic.input.host</name> <value>localhost</value> </property> <property> <name>mapreduce.marklogic.input.port</name> <value>8000</value> </property>
HelloWorld sample reads the first word of text from the input documents, concatenates the words into a string, and saves the result as
HelloWorld.txt. Assuming the database contains only the documents created in Loading the Sample Data, the output document contains the phrase 'hello world'. If your database contains additional documents, you get different results.
hadoopcommand is in your path.
As the job runs, Hadoop reports the job progress to stdout. If the sample job does not run or does not produce the expected results, see Troubleshooting and Debugging.
Near the end of the job output, you should see text similar to the following. Notice there are 2 map input records (
world.xml), 2 map output records (the first word from each input record), and 1 reduce output record (
timestamp INFO mapreduce.Job: map 100% reduce 100% timestamp INFO mapreduce.Job: Job jobId completed successfully timestamp mapreduce.Job: Counters: 33 File System Counters ... Map-Reduce Framework Map input records=2 Map output records=2 Map output bytes=20 Map output materialized bytes=30 Input split bytes=91 Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=30 Reduce input records=2 Reduce output records=1
If you do not see the expected output, see the tips in Troubleshooting and Debugging.
When you submit a MapReduce job to run on an Apache Hadoop cluster, the job resources must be accessible by the master Job Tracker node and all worker nodes. Job resources include the job JAR file, configuration files, and all dependent libraries. When you use the MarkLogic Connector for Hadoop in your job, this includes the connector and XCC JAR files.
You must always have the job resources available on the Hadoop node where you launch the job. Depending on the method you use to make the job resource available across the cluster, dependent JAR files, such as the MarkLogic Connector for Hadoop libraries must be on the HADOOP_CLASSPATH on the node where you launch the job, as described in Configuring Your Environment to Use the Connector.
-libjarsHadoop command line option and parsing the options in your main class using
The best solution depends upon the needs of your application and environment. See the Apache Hadoop documentation for more details on making resources available across a Hadoop cluster. This guide uses
|Document or Directory||Description|
|The XML config files for the sample applications. For details, see Using the Sample Applications.|
|The Javadoc for the connector in both expanded HTML and compressed zip format.|
|The connector and connector examples JAR files, |
|The source code for the sample applications.|
|The data used by several of the examples. For details, see Using the Sample Applications.|