This chapter walks you through a short introduction to mlcp in which you import documents into a database and then export them back out as files in the following steps:
This section leads you through creating a work area and sample data with the following file system layout:
gs/ import/ one.xml two.json export/
Follow this procedure to set up the example work area
bin
directory and the java
commands are on your path. For example, the following example command places the mlcp bin
directory on your path if mlcp is installed in MLCP_INSTALL_DIR:Linux: export PATH=${PATH}:MLCP_INSTALL_DIR/bin Windows: set PATH=%PATH%;MLCP_INSTALL_DIR\bin
mkdir gs cd gs
mkdir import
import/
directory. You can encapsulate mlcp command line options in an options file; for details, see Options File Syntax. An options file is convenient for re-use of commonly used options. Also, using an options file can help you avoid command line interpolation of quotes by the shell.
The examples use an options file to save MarkLogic connection related options so that you can easily re-use them across multiple commands. This section describes how to create this file.
If you prefer to pass the connection options directly on the command line instead, add -username
, -password
, -host
, and possibly -port
options to the example mlcp commands in place of -options_file
.
Use the following procedure to create the example options file.
gs
folder created in Prepare to Run the Examples.cd gs
conn.txt
with the following contents. Each line is either an option name or a value for the preceding option.-username your_username -password your_password -host localhost -port 8000
conn.txt
and modify the values of the -username
and -password
options to match your environment. -host
and/or -port
option values. The host and port must identify a MarkLogic Server App Server that supports the XDBC protocol. MarkLogic Server comes with an App Server pre-configured on port 8000 that supports XDBC, attached to the Documents database. You can choose a different App Server.You should now have the following file structure:
gs/ conn.txt import/ one.xml two.json
Load documents into a MarkLogic Server database using the mlcp import
command. The examples in this section load documents from flat files into the default database associated with the App Server on port 8000 (the Documents database).
Other input options include compressed files, delimited text files, aggregate XML data, line-delimited JSON data, and Hadoop sequence files; for details, see Importing Content Into MarkLogic Server. You can also load document into a different database using the -database
option.
To load a single file, specify the path to the file as the value of -input_file_path
. For example:
-input_file_path import
When you load documents, a default URI is generated based on the type of input data. For details, see Controlling Database URIs During Ingestion.
We will import documents from flat files, so the default URI is the absolute pathname of the input file. For example, if your work area is /space/gs
on Linux or C:\gs
on Windows, then the default URI when you import documents from gs/import
is as follows:
Linux: /space/gs/import/filename Windows: /c:/gs/import/filename
You can use the -output_uri_replace
option to strip off the portion of the URI that comes from the path steps before gs
. The option argument is of the form pattern,replacement_text. For example, given the default URIs shown above, we'll add the following option to create URIs that begin with /gs:
Linux: -output_uri_replace "/space,''" Windows: -output_uri_replace "/c:,''"
Run the following command from the root of your work area (gs
) to load all the files in the import
directory. Modify the argument to -output_uri_replace
to match your environment.
Linux: mlcp.sh import -options_file conn.txt \ -output_uri_replace "/space,''" -input_file_path import Windows: mlcp.bat import -options_file conn.txt ^ -output_uri_replace "/c:,''" -input_file_path import
The output from mlcp should look similar to the following (but with a timestamp prefix on each line). OUTPUT_RECORDS_COMITTED: 2 indicates mlcp loaded two files. For more details, see Understanding mlcp Output.
INFO contentpump.LocalJobRunner: Content type is set to MIXED. The format of the inserted documents will be determined by the MIME type specification configured on MarkLogic Server. INFO input.FileInputFormat: Total input paths to process : 2 INFO contentpump.LocalJobRunner: completed 100% INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: INFO contentpump.LocalJobRunner: INPUT_RECORDS: 2 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 2 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 2 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0 INFO contentpump.LocalJobRunner: Total execution time: 0 sec
Optionally, use Query Console's Explore feature to examine the contents of the Documents database and see that the documents were created. You should see documents with the following URIs:
/gs/import/one.xml /gs/import/two.json
You can also create documents from files in a compressed file and from other types of input archives. For details, see Importing Content Into MarkLogic Server.
Use the mlcp export
command to export documents from a MarkLogic Server database into files on your filesystem. You can export documents to several formats, including files, compressed files, and database archives. For details, see Exporting Content from MarkLogic Server.
You can identify the documents to export in several ways, including by URI, by directory, by collection, and by XPath expression. This example uses a directory filter. Recall that the input documents were loaded with URIs of the form /gs/import/
filename. Therefore we can easily extract the files by database directory using -directory_filter /gs/import/
.
This example exports documents from the default database associated with the App Server on port 8000. Use the -database
option to export documents from a different database.
Use the following procedure to export the documents inserted in Load Documents.
gs
folder created in Prepare to Run the Examples. For example:cd gs
export
. The export
directory must not already exist.Linux: mlcp.sh export -options_file conn.txt -output_file_path export \ -directory_filter /gs/import/ Windows: mlcp.bat export -options_file conn.txt -output_file_path export ^ -directory_filter /gs/import/
You should see output similar to the following, but with a timestamp prefix on each line. The OUTPUT_RECORDS: 2
line indicates mlcp exported 2 files.
INFO mapreduce.MarkLogicInputFormat: Fetched 1 forest splits. INFO mapreduce.MarkLogicInputFormat: Made 1 splits. INFO contentpump.LocalJobRunner: completed 100% INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: INFO contentpump.LocalJobRunner: INPUT_RECORDS: 2 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 2 INFO contentpump.LocalJobRunner: Total execution time: 0 sec
The exported documents are in gs/export
. A filesystem directory is created for each directory step in the original document URI. Therefore, you should now have the following directory structure:
gs/ export/ gs/ import/ one.xml two.json
The output from mlcp varies depending on the operation (import, export, copy, extract), but usually looks similar to the following (with a timestamp prefix on each line). The following example is output from an import job.
INFO contentpump.LocalJobRunner: Content type is set to MIXED. The format of the inserted documents will be determined by the MIME type specification configured on MarkLogic Server. INFO input.FileInputFormat: Total input paths to process : 2 INFO contentpump.LocalJobRunner: completed 100% INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.ContentPumpStats: INFO contentpump.LocalJobRunner: INPUT_RECORDS: 2 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 2 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 2 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0 INFO contentpump.LocalJobRunner: Total execution time: 0 sec
The following table summarizes the purpose of key pieces of information reported by mlcp:
Message | Description |
---|---|
Content type is set to format X. |
Import only. This indicates the type of documents mlcp will create. The default is MIXED, which means mlcp will base the type on the input file suffix. For details, see How mlcp Determines Document Type. |
Total input paths to process : N |
Import only. Found N candidate input sources. If this number is 0, then the pathname you supplied to -input_file_path does not contain any data that meets your import criteria. If you're unable to diagnose the cause, refer to Troubleshooting. |
INPUT_RECORDS: N |
The number of inputs mlcp actually tried to process. For an import operation, this is the number of documents mlcp attempted to create. For an export operation, this is number of documents mlcp attempted to export. If there are errors, this number may not correspond to the actual number of documents imported, exported, copied, or extracted. This number can be larger or smaller than the total input paths. For example, if you import from a compressed file that includes directories, the directories count towards total inputs paths, but mlcp will only attempt to create documents from the file entries, so total paths will be larger than the attempted records. Similarly, if you're loading aggregate XML files and splitting them into multiple documents, then total input paths reflects the number of aggregate files, while the attempted records reflects the number of documents created from the aggregates, so total paths is less than attempted records. |
ESTIMATED_INPUT_RECORDS: N |
Export and copy only. The estimated number of input records, based on job parameters such as -document_selector and -input_query . This number will be larger than INPUT_RECORDS if errors occur while fetching documents from MarkLogic or when the database is configured to use fragment roots. For example, if the source database contain N documents matching the job parameters, but a host in the cluster becomes unavailable during the job, then the actual number of documents mlcp attempts to process can be some M < N. In such a case, ESTIMATED_INPUT_RECORDS reflects N, while INPUT_RECORDS reflects M. |
OUTPUT_RECORDS: N |
On import, the number of documents (records) sent to MarkLogic for insertion into the database. This number can be smaller than INPUT_RECORDS if errors are detected on the client that cause a record to be skipped. On export, the number of output files mlcp successfully created. |
OUTPUT_RECORDS_COMMITTED: N |
Import only. The number of documents committed to the database. This number can be larger or smaller than OUTPUT_RECORDS. For example, it will be smaller if an error is detected on MarkLogic Server or larger if a server-side transformation creates multiple documents from a single input document. |
OUTPUT_RECORDS_FAILED: N |
Import only. The number of documents (records) rejected by MarkLogic Server. This number does not include failures detected by mlcp on the client. |
Note that if you stop a job prematurely, some work might continue.
When you use mlcp in distributed mode, mlcp distributes its work across a Hadoop cluster. Interrupting the local mlcp client does not cause work to stop on the Hadoop cluster. In local mode, an interrupted job will shutdown gracefully as long as it can finish withint 30 seconds. If that time period expires, mlcp prints a warning.