mlcp User Guide (PDF)

MarkLogic 9 Product Documentation
mlcp User Guide
— Chapter 3

« Previous chapter
Next chapter »

Getting Started With mlcp

This chapter walks you through a short introduction to mlcp in which you import documents into a database and then export them back out as files in the following steps:

Prepare to Run the Examples

This section leads you through creating a work area and sample data with the following file system layout:

gs/
  import/
    one.xml
    two.json
  export/

Follow this procedure to set up the example work area

  1. Download and install mlcp according to the instructions in Installation and Configuration.
  2. Ensure the mlcp bin directory and the java commands are on your path. For example, the following example command places the mlcp bin directory on your path if mlcp is installed in MLCP_INSTALL_DIR:
    Linux: export PATH=${PATH}:MLCP_INSTALL_DIR/bin
    Windows: set PATH=%PATH%;MLCP_INSTALL_DIR\bin
  3. Create a directory to serve as your work area and change directories to this work area. For example:
    mkdir gs
    cd gs
  4. Create a sub-directory to hold the sample input and output data. For example:
    mkdir import
  5. Create the sample input files in the import/ directory.
    1. Use the following commands on Linux:
      echo '<data>1</data>' > import/one.xml
      echo '{"two": 2}' > import/two.json
    2. Use the following commands on Windows:
      echo ^<data^>1^</data^> > import\one.xml
      echo {"two":2} > import\two.json

Optional: Create an Options File

You can encapsulate mlcp command line options in an options file; for details, see Options File Syntax. An options file is convenient for re-use of commonly used options. Also, using an options file can help you avoid command line interpolation of quotes by the shell.

The examples use an options file to save MarkLogic connection related options so that you can easily re-use them across multiple commands. This section describes how to create this file.

If you prefer to pass the connection options directly on the command line instead, add -username, -password, -host, and possibly -port options to the example mlcp commands in place of -options_file.

Use the following procedure to create the example options file.

  1. If you are not already at the top level of your work area, change directory to this location. That is, the gs folder created in Prepare to Run the Examples.
    cd gs
  2. Create a file named conn.txt with the following contents. Each line is either an option name or a value for the preceding option.
    -username
    your_username
    -password
    your_password
    -host
    localhost
    -port
    8000
  3. Edit conn.txt and modify the values of the -username and -password options to match your environment.
  4. Optionally, modify the -host and/or -port option values. The host and port must identify a MarkLogic Server App Server that supports the XDBC protocol. MarkLogic Server comes with an App Server pre-configured on port 8000 that supports XDBC, attached to the Documents database. You can choose a different App Server.

You should now have the following file structure:

gs/
  conn.txt
  import/
    one.xml
    two.json

Load Documents

Load documents into a MarkLogic Server database using the mlcp import command. The examples in this section load documents from flat files into the default database associated with the App Server on port 8000 (the Documents database).

Other input options include compressed files, delimited text files, aggregate XML data, line-delimited JSON data, and Hadoop sequence files; for details, see Importing Content Into MarkLogic Server. You can also load document into a different database using the -database option.

To load a single file, specify the path to the file as the value of -input_file_path. For example:

-input_file_path import

When you load documents, a default URI is generated based on the type of input data. For details, see Controlling Database URIs During Ingestion.

We will import documents from flat files, so the default URI is the absolute pathname of the input file. For example, if your work area is /space/gs on Linux or C:\gs on Windows, then the default URI when you import documents from gs/import is as follows:

Linux: /space/gs/import/filename
Windows: /c:/gs/import/filename

You can use the -output_uri_replace option to strip off the portion of the URI that comes from the path steps before gs. The option argument is of the form pattern,replacement_text. For example, given the default URIs shown above, we'll add the following option to create URIs that begin with /gs:

Linux: -output_uri_replace "/space,''"
Windows: -output_uri_replace "/c:,''"

Run the following command from the root of your work area (gs) to load all the files in the import directory. Modify the argument to -output_uri_replace to match your environment.

Linux: 
  mlcp.sh import -options_file conn.txt \
    -output_uri_replace "/space,''" -input_file_path import

Windows:
  mlcp.bat import -options_file conn.txt ^
    -output_uri_replace "/c:,''" -input_file_path import

The output from mlcp should look similar to the following (but with a timestamp prefix on each line). OUTPUT_RECORDS_COMITTED: 2 indicates mlcp loaded two files. For more details, see Understanding mlcp Output.

INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of 
  the  inserted documents will be determined by the MIME  type specification 
  configured on MarkLogic Server.
INFO input.FileInputFormat: Total input paths to process : 2
INFO contentpump.LocalJobRunner:  completed 100%
INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
INFO contentpump.LocalJobRunner: INPUT_RECORDS: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
INFO contentpump.LocalJobRunner: Total execution time: 0 sec

Optionally, use Query Console's Explore feature to examine the contents of the Documents database and see that the documents were created. You should see documents with the following URIs:

/gs/import/one.xml
/gs/import/two.json

You can also create documents from files in a compressed file and from other types of input archives. For details, see Importing Content Into MarkLogic Server.

Export Documents

Use the mlcp export command to export documents from a MarkLogic Server database into files on your filesystem. You can export documents to several formats, including files, compressed files, and database archives. For details, see Exporting Content from MarkLogic Server.

You can identify the documents to export in several ways, including by URI, by directory, by collection, and by XPath expression. This example uses a directory filter. Recall that the input documents were loaded with URIs of the form /gs/import/filename. Therefore we can easily extract the files by database directory using -directory_filter /gs/import/.

This example exports documents from the default database associated with the App Server on port 8000. Use the -database option to export documents from a different database.

Use the following procedure to export the documents inserted in Load Documents.

  1. If you are not already at the top level of your work area, change directory to this location. That is, the gs folder created in Prepare to Run the Examples. For example:
    cd gs
  2. Extract the previously inserted documents into a directory named export. The export directory must not already exist.
    Linux:
      mlcp.sh export -options_file conn.txt -output_file_path export \
        -directory_filter /gs/import/
    
    Windows:
      mlcp.bat export -options_file conn.txt -output_file_path export ^
        -directory_filter /gs/import/

You should see output similar to the following, but with a timestamp prefix on each line. The OUTPUT_RECORDS: 2 line indicates mlcp exported 2 files.

INFO mapreduce.MarkLogicInputFormat: Fetched 1 forest splits.
INFO mapreduce.MarkLogicInputFormat: Made 1 splits.
INFO contentpump.LocalJobRunner:  completed 100%
INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
INFO contentpump.LocalJobRunner: INPUT_RECORDS: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 2
INFO contentpump.LocalJobRunner: Total execution time: 0 sec

The exported documents are in gs/export. A filesystem directory is created for each directory step in the original document URI. Therefore, you should now have the following directory structure:

gs/
  export/
    gs/
      import/
        one.xml
        two.json

Understanding mlcp Output

The output from mlcp varies depending on the operation (import, export, copy, extract), but usually looks similar to the following (with a timestamp prefix on each line). The following example is output from an import job.

INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of 
  the  inserted documents will be determined by the MIME  type specification 
  configured on MarkLogic Server.
INFO input.FileInputFormat: Total input paths to process : 2
INFO contentpump.LocalJobRunner:  completed 100%
INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.ContentPumpStats:
INFO contentpump.LocalJobRunner: INPUT_RECORDS: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
INFO contentpump.LocalJobRunner: Total execution time: 0 sec

The following table summarizes the purpose of key pieces of information reported by mlcp:

Message Description
Content type is set to format X.
Import only. This indicates the type of documents mlcp will create. The default is MIXED, which means mlcp will base the type on the input file suffix. For details, see How mlcp Determines Document Type.
Total input paths to process : N
Import only. Found N candidate input sources. If this number is 0, then the pathname you supplied to -input_file_path does not contain any data that meets your import criteria. If you're unable to diagnose the cause, refer to Troubleshooting.
INPUT_RECORDS: N

The number of inputs mlcp actually tried to process. For an import operation, this is the number of documents mlcp attempted to create. For an export operation, this is number of documents mlcp attempted to export. If there are errors, this number may not correspond to the actual number of documents imported, exported, copied, or extracted.

This number can be larger or smaller than the total input paths. For example, if you import from a compressed file that includes directories, the directories count towards total inputs paths, but mlcp will only attempt to create documents from the file entries, so total paths will be larger than the attempted records.

Similarly, if you're loading aggregate XML files and splitting them into multiple documents, then total input paths reflects the number of aggregate files, while the attempted records reflects the number of documents created from the aggregates, so total paths is less than attempted records.

ESTIMATED_INPUT_RECORDS: N
Export and copy only. The estimated number of input records, based on job parameters such as -document_selector and -input_query. This number will be larger than INPUT_RECORDS if errors occur while fetching documents from MarkLogic or when the database is configured to use fragment roots. For example, if the source database contain N documents matching the job parameters, but a host in the cluster becomes unavailable during the job, then the actual number of documents mlcp attempts to process can be some M < N. In such a case, ESTIMATED_INPUT_RECORDS reflects N, while INPUT_RECORDS reflects M.
OUTPUT_RECORDS: N

On import, the number of documents (records) sent to MarkLogic for insertion into the database. This number can be smaller than INPUT_RECORDS if errors are detected on the client that cause a record to be skipped.

On export, the number of output files mlcp successfully created.

OUTPUT_RECORDS_COMMITTED: N
Import only. The number of documents committed to the database. This number can be larger or smaller than OUTPUT_RECORDS. For example, it will be smaller if an error is detected on MarkLogic Server or larger if a server-side transformation creates multiple documents from a single input document.
OUTPUT_RECORDS_FAILED: N
Import only. The number of documents (records) rejected by MarkLogic Server. This number does not include failures detected by mlcp on the client.

Stopping an mclp Job Prematurely

Note that if you stop a job prematurely, some work might continue.

When you use mlcp in distributed mode, mlcp distributes its work across a Hadoop cluster. Interrupting the local mlcp client does not cause work to stop on the Hadoop cluster. In local mode, an interrupted job will shutdown gracefully as long as it can finish withint 30 seconds. If that time period expires, mlcp prints a warning.

« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy