Ingest Using MLCP

MarkLogic Content Pump (MLCP) is a standalone Java utility provided by MarkLogic.

You can use MLCP to ingest data into MarkLogic Server. You can set up two flows:

one containing only the ingestion step using MLCP, and
one with mapping and mastering (combined or split) steps using other tools.

Before you begin

You need:

MLCP
A MarkLogic Data Hub project

Procedure

Create a flow.
- Using Gradle
Create an ingestion step.
- Using Gradle
Note: The parameter values you use at the command-line call to MLCP will override the settings in the ingestion step.
Open a command-line window, and navigate to your MLCP project root directory.
At your project root, run the mlcp command with the -transform* parameters.
Note: MLCP must generate unique URIs. Setting -generate_uri "true" is recommended. Otherwise, specify -uri_id with a column that contains unique values.
- Using Gradle: Copy and paste the following code to the command line, customize it, and run it.
```
   mlcp.sh import \
    -host "localhost" -port "8010"  \
    -username "flow-operator-user-account" -password "*****"  \
    -input_file_path "/path/to/your/input/directory"  \
    -input_file_type "delimited_text"  \
    -output_collections "ingestion-only,input"  \
    -output_permissions "data-hub-common,read,data-hub-common,update"  \
    -generate_uri "true"  \
    -output_uri_replace ".*input,'/ingestion-flow/json/'"  \
    -document_type "json"  \
    -transform_module "/data-hub/5/transforms/mlcp-flow-transform.sjs"  \
    -transform_param "flow-name=MyFlow,step=1"
```
```
   mlcp.bat import ^
    -host "localhost" -port "8010"  ^
    -username "flow-operator-user-account" -password "*****"  ^
    -input_file_path "/path/to/your/input/directory"  ^
    -input_file_type "delimited_text"  ^
    -output_collections "ingestion-only,input"  ^
    -output_permissions "data-hub-common,read,data-hub-common,update"  ^
    -generate_uri "true"  ^
    -output_uri_replace ".*input,'/ingestion-flow/json/'"  ^
    -document_type "json"  ^
    -transform_module "/data-hub/5/transforms/mlcp-flow-transform.sjs"  ^
    -transform_param "flow-name=MyFlow,step=1"
```
-host

The host for the MarkLogic Server instance.

-port

The port for the MarkLogic Server instance.

-input_file_path

The location of your source files.

-input_file_type

The format of your source files: Text, JSON, XML, Binary, or Delimited Text. See MLCP: Supported Input Format Summary.

-generate_uri

If true, a custom URI is associated with each new record.

-output_collections

A comma-separated string containing the collection tags to associate with the new records.

-output_permissions

The comma-separated roles required to access the ingested data.

-output_uri_replace
A comma-separated list of replacements used to customize the URIs of the ingested records.

The list is comprised of regular expression patterns and their replacement strings in the format pattern,'string',pattern,'string',.... The replacement strings must be enclosed in single quotes.

For example, if the original URI is in the form "/foo/bar/filename", you can customize it to be "/mydir/filename" using the following comma-separated list:

/foo/bar,'/mydir'

Java's regular expression language is supported.

If Source File Type is set to CSV, the substitution is based on the absolute path of the parent folder; otherwise, the absolute path of the input file. For example, if the Windows path is c:\path\filename, the substitution is based on /c/path/filename.
Learn more: Transforming the Default URI.
-document_type
The format of the processed record: Text, JSON, XML, or Binary. Valid values:
- json
- xml
- text
- binary
If you specify a document type that is different from the input file type, MLCP handles the transformation.
-transform_module

Must be the URI /data-hub/5/transforms/mlcp-flow-transform.sjs. This module is added to the MODULES database when Data Hub is installed.

-transform_param
A comma-delimited list of key-value pairs to be passed to the /data-hub/5/transforms/mlcp-flow-transform.sjs module. Valid keys:
- flow-name. The human-friendly name of the flow.
- step. The number which represents the order of the step in the sequence.
- job-id. (Optional) A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.
- options. (Optional) Additional options to pass to the flow. Must be a JSON object.

What to do next

Create another flow using Gradle. Then add mapping and mastering (combined or split) steps to enhance the ingested data.