Skip to main content

Using MarkLogic Content Pump (mlcp)

Load Documents

Load documents into a MarkLogic Server database using the mlcp import command. The examples in this section load documents from flat files into the default database associated with the App Server on port 8000 (the Documents database).

Other input options include compressed files, delimited text files, aggregate XML data, and line-delimited JSON data. See Importing Content into MarkLogic Server for details. You can also load document into a different database using the -database option.

To load a single file, specify the path to the file as the value of -input_file_path. For example:

-input_file_path import

When you load documents, a default URI is generated based on the type of input data. For details, see Controlling Database URIs During Ingestion.

We will import documents from flat files, so the default URI is the absolute pathname of the input file. For example, if your work area is /space/gs on Linux or C:\gs on Windows, then the default URI when you import documents from gs/import is as follows:

Linux: /space/gs/import/filenameWindows: /c:/gs/import/filename

You can use the -output_uri_replace option to strip off the portion of the URI that comes from the path steps before “gs”. The option argument is of the form “pattern,replacement_text”. For example, given the default URIs shown above, we’ll add the following option to create URIs that begin with “/gs”:

Linux: -output_uri_replace "/space,''"
Windows: -output_uri_replace "/c:,''"

Run the following command from the root of your work area (gs) to load all the files in the import directory. Modify the argument to -output_uri_replace to match your environment.

Linux: 
  mlcp.sh import -options_file conn.txt \
    -output_uri_replace "/space,''" -input_file_path import
Windows:
  mlcp.bat import -options_file conn.txt ^
    -output_uri_replace "/c:,''" -input_file_path import

The output from mlcp should look similar to the following (but with a timestamp prefix on each line). “OUTPUT_RECORDS_COMITTED: 2” indicates mlcp loaded two files. For more details, see Understand mlcp Output.

INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of 
  the  inserted documents will be determined by the MIME  type specification 
  configured on MarkLogic Server.
INFO input.FileInputFormat: Total input paths to process : 2
INFO contentpump.LocalJobRunner:  completed 100%
INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
INFO contentpump.LocalJobRunner: INPUT_RECORDS: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 2
INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
INFO contentpump.LocalJobRunner: Total execution time: 0 sec

Optionally, use Query Console’s Explore feature to examine the contents of the Documents database and see that the documents were created. You should see documents with the following URIs:

/gs/import/one.xml
/gs/import/two.json

You can also create documents from files in a compressed file and from other types of input archives. For details, see Importing Content into MarkLogic Server.