MarkLogic Connector for Hadoop Developer's Guide (PDF)

MarkLogic 9 Product Documentation
MarkLogic Connector for Hadoop Developer's Guide
— Chapter 7

« Previous chapter

Using the Sample Applications

This chapter covers the following topics related to using the sample applications:

Set Up for All Samples

The following topics apply to preparing to run all the sample applications:

  1. Install Required Software
  2. Configure Your Environment
  3. Copy the Sample Configuration Files
  4. Modify the Sample Configuration Files

The LinkCount samples, such as LinkCountInDoc and LinkCountValue, require additional preparation. See Additional Sample Data Setup.

For details about the individual samples, see Sample Applications.

Install Required Software

Install and configure MarkLogic Server, Hadoop MapReduce, and the MarkLogic Connector for Hadoop. For instructions, see Getting Started with the MarkLogic Connector for Hadoop.

The samples require at least one MarkLogic Server database and XDBC App Server. The examples in this chapter assume you're using the XDBC App Server on port 8000.

The LinkCount family of samples require a specific database configuration and data set; see Additional Sample Data Setup. The other samples can be run against any XDBC App Server and database.

Multi-host Configuration Considerations

Getting Started with the MarkLogic Connector for Hadoop describes setting up a single-host configuration, where MarkLogic Server, Hadoop MapReduce, and the MarkLogic Connector for Hadoop are installed on the same host, and Hadoop MapReduce is configured for standalone operation. A multi-host configuration, with Hadoop MapReduce configured for pseudo-distributed or fully-distributed operation, more accurately represents a production deployment.

If you choose to use a multi-host, distributed configuration be aware of the following:

  • The MarkLogic Server host configured for the job must be reachable by hostname from the Hadoop MapReduce worker nodes.
  • The MarkLogic Connector for Hadoop must be installed on the Hadoop MapReduce host on which you run the sample jobs.
  • Normally, you can use different MarkLogic Server instances for input and output, but the LinkCount samples expect the same database for both input and output.

Some of the samples use HDFS for input or output. If Hadoop is configured for pseudo- or fully-distributed operation, HDFS must be initialized before running the samples.

To check whether or not HDFS is initialized, run the following command. It should run without error. For example:

$ hdfs dfs -ls /
drwxr-xr-x   - marklogic\me mygroup 0 2011-07-19 10:48 /tmp
drwxr-xr-x   - marklogic\me mygroup 0 2011-07-19 10:51 /user

If the command fails, HDFS might not be initialized. See Initializing HDFS.

Configure Your Environment

Before you begin, you should have the hadoop and java commands on your path. You should also set the environment variables covered in Configuring Your Environment to Use the Connector.

Copy the Sample Configuration Files

The sample applications include MapReduce configuration files containing MarkLogic Connector for Hadoop settings. To run the examples, you will have to modify these files. Therefore, you should copy the configuration files to a local directory of your choosing.

For example, to copy the configuration files to /space/examples/conf, use the following command:

cp $CONNECTOR_HOME/conf/*.xml /space/examples/conf

Place the directory containing your copy of the configuration files on HADOOP_CLASSPATH so that each sample job can find its configuration file. For example:

export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/space/examples/conf

Modify the Sample Configuration Files

For each sample you plan to run, modify the MarkLogic Connector for Hadoop sample configuration file in your Hadoop configuration directory to match your MarkLogic Server configuration.

The configuration file associated with each sample is listed below.

Sample Configuration File
HelloWorld marklogic-hello-world.xml
LinkCountInDoc marklogic-nodein-nodeout.xml
LinkCountInProperty marklogic-textin-propout.xml
LinkCountValue marklogic-textin-textout.xml
LinkCountCooccurrences marklogic-lexicon.xml
LinkCount marklogic-advanced.xml
RevisionGrouper marklogic-nodein-qryout.xml
BinaryReader marklogic-subbinary.xml
ContentReader marklogic-docin-textout.xml
ContentLoader marklogic-textin-docout.xml
ZipContentLoader marklogic-textin-docout.xml

The configuration properties requiring modification vary from sample to sample. For example, a sample which uses MarkLogic Server for input and HDFS for output will not include mapreduce.marklogic.output.* properties.

If the sample uses MarkLogic Server for input, modify at least the following config properties. For details, see Identifying the Input MarkLogic Server Instance.

Property Value
mapreduce.marklogic.input.username
A MarkLogic user with privileges to read the input database.
mapreduce.marklogic.input.password
The password for the input user.
mapreduce.marklogic.input.host
localhost, or the host where your input MarkLogic instance is installed.
mapreduce.marklogic.input.port
8000, or another port on which an XDBC App Server is listening
mapreduce.marklogic.input.databasename

hadoop-samples

You will need to add this property to the configuration file.

If the sample uses MarkLogic Server for output, modify at least the following config properties. For details, see Identifying the Output MarkLogic Server Instance.

Property Value
mapreduce.marklogic.output.username
A MarkLogic user with privileges to write to the output database.
mapreduce.marklogic.output.password
The password for the output user.
mapreduce.marklogic.output.host
localhost, or the host where your output MarkLogic instance is installed.
mapreduce.marklogic.output.port
8000, or another port on which an XDBC App Server is listening
mapreduce.marklogic.output.databasename

hadoop-samples

You will need to add this property to the configuration file.

Some samples might require additional customization. For details on a specific sample, see Sample Applications.

Additional Sample Data Setup

The following samples require a special database configuration and input data set. If you do not plan to run these samples, you can skip this section.

  • The LinkCount* samples (LinkCountInDoc, LinkCountValue, etc.)
  • RevisionGrouper

This section walks you through creating the MarkLogic Server environment required by these samples.

Creating the Database

Use the following information to create a database named hadoop-samples with 2 forests and 2 attribute range indexes. You can use a different database name. Use the defaults for any configuration parameters not mentioned in this example.

For detailed instructions, see Creating and Configuring Forests and Databases and Defining Attribute Range Indexes in the Administrator's Guide.

Configuration Parameter Setting
database name hadoop-samples
forest names hadoop-samples-1, hadoop-samples-2
attribute range index 1 scalar type string
parent namespace uri http://www.mediawiki.org/xml/export-0.4/
parent localname a
localname href
collation Unicode Codepoint
range value positions true
attribute range index 2 scalar type string
parent namespace uri http://www.mediawiki.org/xml/export-0.4/
parent localname a
localname title
collation Unicode Codepoint
range value positions true

Creating the XDBC App Server

You can skip this step if you use the pre-configured XDBC App Server on port 8000.

Use the following information to create an XDBC App Server and attach it to the hadoop-samples database created in the previous section. You can use a different name and port.

For detailed instructions, see Creating and Configuring App Servers in the Administrator's Guide.

Configuration Parameter Setting
xdbc server name hadoop-samples-xdbc
root (any)
port 9002
database hadoop-samples

Loading the Data

Load the data from $CONNECTOR_HOME/sample-data into the hadoop-samples database with a URI prefix of enwiki/. The instructions in this section use MarkLogic Content Pump (mlcp) to load the data, but you can choose a different method.

  1. If you do not already have an installation of mlcp, download and install it. For details, see Installation and Configuration in the mlcp User Guide.
  2. Put the mlcp.sh command on your path. For example:
    export PATH=${PATH}:MLCP_INSTALL_DIR/bin
  3. Run the following command to load the sample data. Substitute the values of the -username, -password, -host, and -port options to match your environment.
    mlcp.sh import -host localhost -port 8000 -database hadoop-samples \
      -username user -password password -mode local \
      -input_file_path $CONNECTOR_HOME/sample-data/ -document_type xml \
      -output_uri_replace "$CONNECTOR_HOME/sample-data,'enwiki'"
  4. Optionally, use Query Console to explore the hadoop-samples database and observe the database contains 93 documents, all with an enwiki/ prefix.

Interacting with HDFS

Some of the samples use HDFS for input or output. This section briefly summarizes how to copy data into or retrieve data from HDFS when using Hadoop in pseudo-distributed or fully-distribute configurations.

If you use Hadoop MapReduce standalone, you can skip this section. Standalone Hadoop is the configuration created in Getting Started with the MarkLogic Connector for Hadoop. In a standalone configuration, HDFS uses the local file system directly. You do not need to initialize HDFS, and you can use normal Unix commands to work with the input and output files. You may still use HDFS commands to examine the file system.

This section covers following topics related to pseudo- and fully-distributed HDFS operation:

Use the following command see all available HDFS commands, or consult the documentation for your Hadoop distribution.

$ hdfs dfs -help

Initializing HDFS

If you use Hadoop MapReduce in pseudo-distributed or fully-distributed mode, HDFS must be formatted before you can run the samples. If your HDFS installation is not already initialized, consult the documentation for your Hadoop distribution for instructions.

For example, with Apache Hadoop, you can run the following command to initialize HDFS:

$ hdfs namenode -format

Near the end of the output, you should see a message that HDFS has been successfully formatted. For example:

...
************************************************************/
11/10/03 09:35:14 INFO namenode.FSNamesystem:...
11/10/03 09:35:14 INFO namenode.FSNamesystem: supergroup=supergroup
11/10/03 09:35:14 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/10/03 09:35:14 INFO common.Storage: Image file of size 98 saved ...
11/10/03 09:35:14 INFO common.Storage: Storage directory /tmp/hadoop-sample/dfs/name has been successfully formatted.
11/10/03 09:35:14 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at sample.marklogic.com...
************************************************************/

If formatting succeeds, you can successfully use the ls command to examine HDFS. For example:

$ hdfs dfs -ls /

Accessing Results Saved to HDFS

Some of the sample applications store results to HDFS. You can browse HDFS and examine results from the command line or through your web browser.

In pseudo-distributed or fully distributed configurations, HDFS output pathnames given on the command line of an example are relative to /user/your_username in HDFS by default. For example if you run the LinkCountValue example, which saves results to HDFS, and specify the output directory as linkcountvalue, then the results are in HDFS under /user/your_username/linkcountvalue.

To access HDFS through your web browser, use the HDFS NameNode administration page. By default, this interface is available on port 50070 on the NameNode host; consult the documentation for your Hadoop distribution. Assuming localhost is the NameNode, browse to this URL and click on the Browse the file system link near the top of the page to browse HDFS:

http://localhost:50070

To browse HDFS from the command line, use a command similar to the following:

$ hdfs dfs -ls /user/your_username

For example if you run the LinkCountValue example and specify the output directory as linkcountvalue, you would see results similar to the following, after running the example:

$ hdfs dfs -ls /user/me/linkcountvalue
drwxr-xr-x   - me mygroup ... /user/me/linkcountvalue/_logs
-rw-r--r--   1 me mygroup ... /user/me/linkcountvalue/part-r-00000

The results are in the part-r-XXXXX file. To see the last few lines of the results, use a command similar to the following:

$ hdfs dfs -tail /user/me/linkcountvalue/part-r-00000

To copy the result from HDFS to your system's file system, use a command similar to the following:

$ hdfs dfs -get /user/me/linkcountvalue/part-r-00000 \
     /my/destination/linkcountvalue.txt

Placing Content in HDFS to Use as Input

Some of the samples use HDFS for input. These samples require you to copy the input data to HDFS before running the sample. Place the input files under /user/your_username in HDFS using a command such as the following:

$ hdfs dfs -put ./mycontent.zip /user/me/zipcontentloader

Relative pathnames are relative to /user/your_username, so to check the file copied into HDFS above, use a command similar to the following:

$ hdfs dfs -ls /user/me/zipcontentloader
-rw-r--r--   1 me mygroup ... /user/me/zipcontentloader/mycontent.zip

When you copy files into HDFS, there must not be a pre-existing file of the same name.

Sample Applications

This section contains detailed instructions for running each of the samples summarized in the table below.

The MarkLogic Connector for Hadoop distribution includes the following resources related to the sampoles:

  • Source code, in $CONNECTOR_HOME/src.
  • Compiled code, in $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar.
  • Javadoc. See the package com.marklogic.mapreduce.examples in the Javadoc under $CONNECTOR_HOME/docs.

The sample applications are:

Sample Input Output Description
HelloWorld MarkLogic Server MarkLogic Server Reads the first word from text in input XML documents, concatentates the words, then stores the results as a new text document in MarkLogic Server.
LinkCountInDoc MarkLogic Server MarkLogic Server Counts href link title attributes in documents in MarkLogic Server, then stores the count as a child node of the referenced document.
LinkCountInProperty MarkLogic Server MarkLogic Server Counts href link title attributes in documents in MarkLogic Server, then stores the count as a property of the referenced document.
LinkCountValue MarkLogic Server HDFS Counts href link titles attributes in documents in MarkLogic Server, then stores the counts in HDFS text files.
LinkCountCooccurrences MarkLogic Server HDFS Counts href link title attributes in documents in MarkLogic Server using a lexicon function, then stores the counts in HDFS text files.
LinkCount MarkLogic Server HDFS Equivalent to LinkCountValue, but demonstrates using advanced input mode to provide your own input split and input queries.
RevisionGrouper MarkLogic Server MarkLogic Server Demonstrates the use of a custom output query, using KeyValueOutputFormat.
BinaryReader MarkLogic Server HDFS Demonstrates using advanced input mode with an input query optimized using the split range.
ContentReader MarkLogic Server HDFS Reads documents in a MarkLogic Server database, using an SSL-enabled connection, then writes the contents to HDFS text files.
ContentLoader HDFS MarkLogic Server Reads text files in HDFS, then stores the contents as documents in a MarkLogic Server database.
ZipContentLoader HDFS MarkLogic Server Reads text files from zip files in HDFS, then stores the contents as documents in a MarkLogic Server database.

HelloWorld

This example extracts the first word from all the XML documents in a MarkLogic Server database containing text nodes, sorts the words, concatenates them into a single string, and saves the result as a text document in MarkLogic Server. The example uses basic input mode with the default document selector and subexpression expression. The example uses MarkLogic Server for both input and output.

For detailed instructions on configuring and running this sample, see Running the HelloWorld Sample Application.

Though you can use the sample with any input documents, it is intended to be used with a small data set. It is not optimized for efficient resource use across large data sets. Only XML documents with text nodes contribute to the final results.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-hello-world.xml

Use the following command to run the example job, with suitable substitution for $CONNECTOR_HOME and the connector version:

hadoop jar \
  $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
  com.marklogic.mapreduce.examples.HelloWorld
  -libjars $LIBJARS marklogic-hello-world.xml

To view the results, use Query Console to explore the output database. The sample creates HelloWorld.txt. If you use the input data from Configuring the Job, HelloWorld.txt should contain the phrase hello world.

LinkCountInDoc

This example calculates reference counts for each document in a set of Wikipedia-based documents, and then stores the reference count for each document as a new <ref-count> child node of the document. The example uses MarkLogic Server for input and output.

Before running the sample, follow the instructions in Additional Sample Data Setup.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-nodein-nodeout.xml

Use the following command to run the example job, with a suitable substitution for $CONNECTOR_HOME and the connector version:

hadoop jar \
  $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
  com.marklogic.mapreduce.examples.LinkCountInDoc \
  -libjars $LIBJARS marklogic-nodein-nodeout.xml

The intra-collection reference counts are stored as new <ref-count> elements under the root of each document. Run the following XQuery in Query Console against the hadoop-samples database to see a list of documents ref-count elements:

xquery version "1.0-ml";

for $ref in //ref-count
return fn:concat(xdmp:node-uri($ref)," ",$ref/text())

You should see results similar to the following:

enwiki/Ayn Rand 1
enwiki/List of characters in Atlas Shrugged 4
enwiki/Academy Award for Best Art Direction 1
enwiki/Academy Award 2
enwiki/Aristotle 5

LinkCountInProperty

This example calculates reference counts for each document in a set of Wikipedia-base documents, and then stores the reference count for each document as a property of the document. The examples uses MarkLogic Server for input and output.

Before running the sample, follow the instructions in Additional Sample Data Setup.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-textin-propout.xml

Use the following command to run the example job, with a suitable substitution for $CONNECTOR_HOME and the connector version:

hadoop jar \
  $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
  com.marklogic.mapreduce.examples.LinkCountInProperty \
  -libjars $LIBJARS marklogic-textin-propout.xml

The intra-collection reference counts are stored as new <ref-count> property elements of each document. Run the following query in Query Console against your XDBC App Server to see a list of documents with at least 20 references:

xquery version "1.0-ml";

for $ref in xdmp:document-properties()//ref-count
return fn:concat(xdmp:node-uri($ref)," ",$ref/text())

You should see results similar to the following:

enwiki/Ayn Rand 1
enwiki/List of characters in Atlas Shrugged 4
enwiki/Academy Award for Best Art Direction 1
enwiki/Academy Award 2
enwiki/Aristotle 5

LinkCountValue

This example calculates reference counts for each document in a set of Wikipedia-base documents, and then stores the reference counts in HDFS. The examples uses MarkLogic Server for input and HDFS for output.

Before running the sample, follow the instructions in Additional Sample Data Setup.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-textin-textout.xml

Use the following command to run the example job, with suitable substitutions for $CONNECTOR_HOME, the HDFS_OUTPUT_DIR, and the connector version:. The HDFS_OUTPUT_DIR must not already exist.

hadoop jar \
  $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
  com.marklogic.mapreduce.examples.LinkCountValue \
  -libjars $LIBJARS marklogic-textin-textout.xml HDFS_OUTPUT_DIR

To view the results, examine the results in HDFS_OUTPUT_DIR, as described in Interacting with HDFS. For example, if you use /home/you/lcv for HDFS_OUTPUT_DIR:

$ hdfs dfs -ls /home/you/lcv
... part-r-00000
$ hdfs dfs -cat /home/you/lcv/part-r-00000 | grep ^Aristotle
Aristotle    5

Each topic title is followed by a reference count. The raw output differs from the results for the LinkCountInDoc and LinkCountInProperty examples because LinkCountValue generates counts for all references, rather than only for documents in the database.

LinkCount

This example calculates reference counts for each document in a set of Wikipedia-base documents, and then stores the reference counts in HDFS. The examples uses MarkLogic Server for input and HDFS for output. This example is the same as the LinkCountValue example, but it uses advanced input mode instead of basic input mode.

Before running the sample, follow the instructions in Additional Sample Data Setup.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-advanced.xml

Use the following command to run the example job, with suitable substitutions for $CONNECTOR_HOME, the HDFS_OUTPUT_DIR, and the connector version:

hadoop jar \
  $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
  com.marklogic.mapreduce.examples.LinkCount \
  -libjars $LIBJARS marklogic-advanced.xml HDFS_OUTPUT_DIR

To view the results, examine the results in HDFS_OUTPUT_DIR, as described in Interacting with HDFS. For example, if you use /home/you/lc for HDFS_OUTPUT_DIR:

$ hdfs dfs -ls /home/you/lc
... part-r-00000
$ hdfs dfs -cat /home/you/lcv/part-r-00000 | grep ^Aristotle
Aristotle    5

Each topic title is followed by a reference count. The raw output differs from the results for the LinkCountInDoc and LinkCountInProperty examples because LinkCount generates counts for all references, rather than only for documents in the database.

For details on advanced input mode, see Advanced Input Mode.

LinkCountCooccurrences

This example calculates reference counts for each document in a set of Wikipedia-base documents by using an element attribute lexicon, and then stores the reference counts in HDFS. The examples uses MarkLogic Server for input and HDFS for output.

The sample uses com.marklogic.mapreduce.functions.ElemAttrValueCooccurrences (a wrapper around cts:element-attribute-value-co-occurrences) to find all href attributes which occur along with title attributes in side anchor tags. The attribute range indexes created in Additional Sample Data Setup support this operation. The map input key-value pairs are (Text, Text) pairs where the key is the href and the value is the title.

Before running the sample, follow the instructions in Additional Sample Data Setup.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-lexicon.xml

Use the following command to run the example job, with suitable substitutions for $CONNECTOR_HOME, HDFS_OUTPUT_DIR, and the connector version:

hadoop jar \
  $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
  com.marklogic.mapreduce.examples.LinkCountCooccurrences \
  -libjars $LIBJARS marklogic-lexicon.xml HDFS_OUTPUT_DIR

To view the results, examine the results in HDFS_OUTPUT_DIR, as described in Interacting with HDFS. For example, if you use /home/you/lc for HDFS_OUTPUT_DIR:

$ hdfs dfs -ls /home/you/lcco
... part-r-00000
$ hdfs dfs -cat /home/you/lcco/part-r-00000 | grep ^Aristotle
Aristotle    5

Each topic title is followed by a reference count. The raw output differs from the results for the LinkCountInDoc and LinkCountInProperty examples because LinkCountCooccurrences generates counts for all references, rather than only for documents in the database.

RevisionGrouper

This sample application demonstrates using KeyValueOutputFormat and a custom output query. The sample places each document in a collection of Wikipedia articles into a collection, based on the year the article was last revised. The sample uses MarkLogic Server for input and output. The job has no reduce phase.

Before running the sample, follow the instructions in Additional Sample Data Setup

The map function input key-value pairs that are the revision timestamp nodes matching the XPath expression fn:collection()//wp:revision/wp:timestamp, using the expression in mapreduce.marklogic.input.subdocumentexpr. These nodes are of the form:

<timestamp>2007-09-28T08:07:26Z</timestamp>

The map function picks the year (2007) off the timestamp and generates output key-value pairs where the key is the document URI as a string and the value is the year as a string. Each pair is then passed to the output query defined in mapreduce.marklogic.output.query, which adds the document named in the key to a collection named after the year.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-nodein-qryout.xml

Use the following command to run the example job, with a suitable substitution for $CONNECTOR_HOME and the connector version:

hadoop jar \
  $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
  com.marklogic.mapreduce.examples.RevisionGrouper \
  -libjars $LIBJARS marklogic-nodein-qryout.xml

To view the results, use Query Console to explore the hadoop-samples database. You should see the documents are now in collections based on the year in which they were revised. Alternatively, run a query similar to the following to see a list of documents in the collection for the year 2009:

xquery version "1.0-ml";
for $d in fn:collection("2009")
return xdmp:node-uri($d)

BinaryReader

This sample application demonstrates using the mapreduce.marklogic.output.bindsplitrange configuration property with advanced input mode. The sample extracts the first 1K bytes from each (binary) document in a database and saves the result in HDFS. The sample uses MarkLogic Server for input and HDFS for output. The sample has no reduce phase.

The input query defined in marlogic-subbinary.xml uses the splitstart and splitend external variables provided by the MarkLogic Connector for Hadoop to optimize input query performance. For details on this feature, see Optimizing Your Input Query.

The sample requires a database containing one or more binary documents, and an XDBC App Server. Since the sample assumes all documents in the database are binary documents, you should not use the database and content set up in Additional Sample Data Setup.

Follow these steps to set up the BinaryReader sample application:

  1. Create a MarkLogic Server database to hold the input data.
  2. Create an XDBC App Server and attach it to the database created in Step 1. You may use any root.
  3. Edit the configuration file marklogic-subbinary.xml to configure the job to use the App Server created in Step 2. For details, see Modify the Sample Configuration Files.
  4. Run the sample using the following command, substituting an appropriate value for HDFS_OUTPUT_DIR and the connector version:
    hadoop jar \
      $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
      com.marklogic.mapreduce.examples.BinaryReader \
      -libjars $LIBJARS marklogic-subbinary.xml HDFS_OUTPUT_DIR

If using standalone Hadoop, view the results in HDFS_OUTPUT_DIR. If using pseudo-distributed or fully-distributed Hadoop, view the results using the Hadoop hdfs command. For example:

$ hdfs dfs -ls HDFS_OUTPUT_DIR

ContentReader

This sample application writes documents in a MarkLogic Server database to the HDFS file system, using an SSL-enabled connection to MarkLogic Server. The input database is the database associated with your XDBC App Server. This sample uses MarkLogic Server for input and HDFS for output.

This sample copies the entire contents of the database to HDFS. Choose a target database accordingly, or modify the sample's configuration file to limit the selected documents.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-docin-textout.xml

Before running the sample, configure your XDBC App Server and Java environment to use SSL. For details, see Making a Secure Connection to MarkLogic Server with SSL. You might need to import the MarkLogic Server self-signed certificate into your JRE default keystore using the Java keytool utility. See the Java documentation for details on adding certificates to the default keystore.

Run the sample as follows, substituting appropriate paths for $CONNECTOR_HOME, HDFS_OUTPUT_DIR. and the connector version.The output directory must not already exist.

hadoop jar \
  $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
  com.marklogic.mapreduce.examples.ContentReader \
  -libjars $LIBJARS marklogic-docin-textout.xml HDFS_OUTPUT_DIR

If using standalone Hadoop, view the results in ~/HDFS_OUTPUT_DIR. If using pseudo-distributed or fully-distributed Hadoop, view the results using the Hadoop hdfs command. For example:

$ hdfs dfs -ls HDFS_OUTPUT_DIR
$ hdfs dfs -tail HDFS_OUTPUT_DIR/part-m-00000

ContentLoader

This sample application loads files in an HDFS directory into a MarkLogic Server database as documents. The destination database is the database associated with your XDBC App Server. The sample uses HDFS for input and MarkLogic Server for output. This sample is a map-only job. That is, there is no reduce step.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-textin-docout.xml

Follow these steps to run the ContentLoader sample application:

  1. Select the text, XML, or binary files to be loaded into the database.
  2. If using Hadoop standalone, create an input directory in your home directory to hold the input files. For example:
    $ mkdir ~/input_dir
  3. If using Hadoop pseudo- or fully-distributed, create an input directory in HDFS to hold the input files. For example:
    $ hdfs dfs -mkdir input_dir
  4. Copy the input files into the input directory created in Step 2 or Step 3. For example:
    $ cp your_input_files ~/input_dir              # standalone
    $ hdfs dfs -put your_input_files input_dir   # distributed
  5. If your input content is not XML, edit marklogic-textin-docout.xml to set the output content type. For example, to load binary files, add:
    <property>
      <name>mapreduce.marklogic.output.content.type</name>
      <value>binary</value>
    </property>
  6. Run the sample application, substituting appropriate paths for $CONNECTOR_HOME, input_dir, and the connector version:
    hadoop jar \
      $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
      com.marklogic.mapreduce.examples.ContentLoader \
      -libjars $LIBJARS marklogic-textin-docout.xml input_dir
  7. Using Query Console, explore the database associated with your XDBC App Server and observe that the input files appear as documents in the database.

The document URIs in the database correspond to the HDFS path. For example, if one of the input documents is located in HDFS on samples.marklogic.com as /user/guest/data/file1.xml, then the document URI in the database is:

hdfs://samples.marklogic.com/user/guest/data/file1.xml

If you receive an error similar to the following, then you must change the directory creation database configuration setting to manual.

java.lang.IllegalStateException: Manual directory creation mode is required.

ZipContentLoader

This sample application loads the contents of zip files in an HDFS directory into a MarkLogic Server database as documents. The destination database is the database associated with your XDBC App Server. The sample uses HDFS for input and MarkLogic Server for output. This sample is a map-only job. That is, there is no reduce step.

This example uses the following configuration file. You should have a copy of this config file in your working directory, modified as described in Modify the Sample Configuration Files.

marklogic-textin-docout.xml

Follow these steps to run the ZipContentLoader sample application:

  1. Create one or more zip files containing text, XML, or binary files.
  2. If using Hadoop standalone, create an input directory in your home directory to hold the zip input files. For example:
    $ mkdir ~/zip_input_dir
  3. If using Hadoop pseudo- or fully-distributed, create an input directory in HDFS to hold the input files. For example:
    $ hdfs dfs -mkdir zip_input_dir
  4. Copy the input zip files into the input directory created in Step 2 or Step 3. For example:
    $ cp your_input_files ~/zip_input_dir           # standalone
    $ hdfs dfs -put your_data.zip zip_input_dir   # distributed
  5. If your zip file content is not XML, set the output content type in marklogic-textin-docout.xml. For example, to load binary files, add:
    <property>
      <name>mapreduce.marklogic.output.content.type</name>
      <value>binary</value>
    </property>
  6. Run the sample application, substituting appropriate paths for $CONNECTOR_HOME, zip_input_dir, and the connector version:
    hadoop jar \
      $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar \
      com.marklogic.mapreduce.examples.ZipContentLoader \
      -libjars $LIBJARS marklogic-textin-docout.xml zip_input_dir
  7. Using Query Console, explore the database associated with your XDBC App Server and observe that the zip file contents appear as documents in the database.

The document URIs in the database correspond to the paths within the zip file. For example, if the zip file contents are rooted at a folder named enwiki, and that folder contains a file named Wisconsin, then the resulting document URI is:

enwiki/Wisconsin

If you receive an error similar to the following, use the Admin Interface to change the directory creation database configuration setting to manual:

java.lang.IllegalStateException: Manual directory creation mode is required.
« Previous chapter
Powered by MarkLogic Server | Terms of Use | Privacy Policy