Apache Hadoop only supports the Oracle/Sun JRE, though other JRE's may work. For details, see http://wiki.apache.org/hadoop/HadoopJavaVersions.
mlcp-version, where version is the mlcp version. For example, assuming
/space/marklogiccontains zip file for mlcp version 1.3, then the following commands install mclp under
javacommand on your path. For example:
You might need to configure your MarkLogic cluster before using mlcp for the first time. For details, see Configuring Your MarkLogic Cluster.
When you use mlcp with MarkLogic 8 or later on the default port (8000), no special cluster configuration is necessary. Port 8000 includes a pre-configured XDBC App Server. The default database associated with port 8000 is the Documents database. To use mlcp with a different database and port 8000, use the
-output_database options. For example:
When using MarkLogic 7 or earlier on any port, or when using MarkLogic 8 or later with a port other than 8000, you must make an XDBC App Server available to each host that has at least one forest attached to the database(s) used by your job. Hosts within a group share the same App Server configuration, but hosts in different groups do not.
Therefore, if all your forest hosts are in a single group, you only need to configure one XDBC App Server. If your forests are on hosts in multiple groups, then you must configure an XDBC App Server listening on the same port in each group.
For example, the cluster shown below is properly configured to use Database A as an mlcp input or output source. Database A has 3 forests, located on 3 hosts in 2 different groups. Therefore, both Group 1 and Group 2 must make Database A accessible via XDBC on port 9001.
If you use MarkLogic 8 or later and port 8000 instead of port 9001, then you would not need to explicitly create any XDBC App Servers to support the above database configuration because both group automatically have an XDBC App Server on port 8000. You might need to explicitly specify the database name (Database A) in your mlcp command, though, if it is not the default database associated with port 8000.
When you use mlcp, you supply the name of a user(s) with which to interact with MarkLogic Server. If the user does not have admin privileges, then the user must have at least the privileges listed in the table below.
Additional privileges may be required. These roles only enable use of MarkLogic Server as a data source or destination. For example, these roles do not grant read or update permissions to the database.
|import||Applies to the user name specified with |
|export||Applies to the user name specified with |
By default, mlcp requires a username and password to be included in the command line options for each job. You can avoid passing a cleartext password between your mlcp client host and MarkLogic Server by using Kerberos for authentication. For details, see Using mlcp With Kerberos.
Distributed mode enables mlcp to distribute its workload across a Hadoop cluster. Using mlcp in distributed mode requires a Hadoop installation. For information on supported versions, see Required Software.
Some versions of Hadoop and HDFS have problems with pathnames that contain spaces, so it is recommended that you do not use mlcp in distributed mode with input or output file pathnames that contain whitespace.
You must tell mlcp where to find the Hadoop configuration files on the host where you run mlcp. Hadoop does not need to be installed on this host, but the Hadoop configuration files must be reachable.
-hadoop_conf_dir. For example:
HADOOP_CONF_DIR. For example:
Use the following mlcp options to pass Hadoop-specific configuration information through mlcp to your Hadoop installation. You can use these options to control mlcp's use of Hadoop in distributed mode.
-confconf_filename : Pass in a Hadoop configuration properties file.
=value : Pass one Hadoop configuration property setting.
When you use distributed mode for import, the user your Hadoop tasks run as must have permission to access the directories or files specified by
-input_file_path. Similarly, when you use distributed mode for export or extract, the user must have permission to create directories and files in the directory specified by
To use MapR as mlcp's Hadoop distribution, you must download the
-bin bundle instead of the standard mlcp bundle. For example, download
mlcp-mapr-8.0-bin.zip from developer.marklogic.com.
-mapr/. Modify version to match your MapR version.
Before you can use Kerberos with mlcp, you must configure your MarkLogic installation to enable external security, as described in External Authentication (LDAP and Kerberos) in the Understanding and Using Security Guide.
Before you can use Kerberos for authentication, you must create at least one MarkLogic user with which mlcp can use Kerberos authentication to connect to MarkLogic Server, as described in Assigning an External Name to a User in the Understanding and Using Security Guide.
For example, if you're using mlcp to import documents into a database, then the user must have update privileges on the target database, as well as the minimum privileges required by mlcp. For details on the minimum privileges required by mlcp, see Security Considerations.
The mlcp tool communicates with MarkLogic through an XDBC App Server. Configure your XDBC App Server to use Kerberos for external security, as described in Configuring an App Server for External Authentication in the Understanding and Using Security Guide.
For example, if you create a configuration named 'kerb-conf', then configure your XDBC App Server with the following values for the 'authentication', 'internal security', and 'external security' configuration settings in the Admin Interface:
You can use an existing XDBC App Server or create a new one. To create a new XDBC App Server, use the Admin Interface, the Admin API, or the REST Management API. For details, see Procedures for Creating and Managing XDBC Servers in the Administrator's Guide.
Configure the App Server to use 'kerberos-ticket' authentication and the Kerberos external security configuration object you created following the instructions in Creating an External Authentication Configuration Object in the Understanding and Using Security Guide.
When you install MarkLogic, an XDBC App Server and other services are available port 8000. Changing the security configuration for the App Server on port 8000 affects all the MarkLogic services available through this port, including the HTTP App Server and REST Client API instance.
kinitor a similar program on your mlcp client host to create and cache a Kerberos Ticket to Get Tickets (TGT) for a principal you assigned to a MarkLogic user.
-passwordoption from the environment in which you cached the TGT.
For example, suppose you configured an XDBC App Server on port 9010 of host 'ml-host' to use 'kerberos-ticket' authentication. Further, suppose you associated the Kerberos principal name 'kuser' with the user 'mluser'. Then the following commands result in mlcp authenticating with Kerberos as user 'kuser', and importing documents into the database as 'mluser'.