Ingest Using MLCP

MarkLogic Content Pump (MLCP) is a standalone Java utility provided by MarkLogic.

You can use MLCP to ingest data into MarkLogic Server. You will need two flows:
  • one containing only the ingestion step, and
  • one with mapping and mastering (combined or split) steps.

Before you begin

You need:

  • MLCP
  • A MarkLogic Data Hub project

Procedure

  1. Create a flow.
  2. Create an ingestion step.
    Note: The parameter values you use at the command-line call to MLCP will override the settings in the ingestion step.
  3. Configure the ingestion step, and copy the MLCP Command code.

    After choosing the settings for your ingestion step, copy the contents of the MLCP Command field to your clipboard.

  4. Open a command-line window, and navigate to your MLCP project root directory.
  5. At your project root, run the mlcp command with the -transform* parameters.
    • Using QuickStart: Paste the MLCP Command code to the command line and run it.
    • Using Gradle: Copy and paste the following code to the command line, customize it, and run it.
         mlcp.sh import \
          -host "localhost" -port "8010"  \
          -username "flow-operator-user-account" -password "*****"  \
          -input_file_path "/path/to/your/input/directory"  \
          -input_file_type "delimited_text"  \
          -output_collections "ingestion-only,input"  \
          -output_permissions "rest-reader,read,rest-writer,update"  \
          -generate_uri "true"  \
          -output_uri_replace ".*input,'/ingestion-flow/json/'"  \
          -document_type "json"  \
          -transform_module "/data-hub/5/transforms/mlcp-flow-transform.sjs"  \
          -transform_param "flow-name=MyFlow,step=1"
      
         mlcp.bat import ^
          -host "localhost" -port "8010"  ^
          -username "flow-operator-user-account" -password "*****"  ^
          -input_file_path "/path/to/your/input/directory"  ^
          -input_file_type "delimited_text"  ^
          -output_collections "ingestion-only,input"  ^
          -output_permissions "rest-reader,read,rest-writer,update"  ^
          -generate_uri "true"  ^
          -output_uri_replace ".*input,'/ingestion-flow/json/'"  ^
          -document_type "json"  ^
          -transform_module "/data-hub/5/transforms/mlcp-flow-transform.sjs"  ^
          -transform_param "flow-name=MyFlow,step=1"
      
    -host
    The host for the MarkLogic Server instance.
    -port
    The port for the MarkLogic Server instance.
    -input_file_path
    The location of your source files.
    -input_file_type
    The format of your source files: Text, JSON, XML, Binary, or Delimited Text. See MLCP: Supported Input Format Summary.
    -generate_uri
    If true, a custom URI is associated with each new record.
    -output_collections
    A comma-separated string containing the collection tags to associate with the new records.
    -output_permissions
    The comma-separated roles required to access the ingested data.
    -output_uri_replace

    A comma-separated list of replacements used to customize the URIs of the ingested records. The list is comprised of regular expression patterns and their replacement strings in the format pattern,'string',pattern,'string',.... The replacement strings must be enclosed in single quotes.

    See Transforming the Default URI.

    -document_type
    The format of the processed record: Text, JSON, XML, or Binary. Valid values:
    • json
    • xml
    • text
    • binary

    If you specify a document type that is different from the input file type, MLCP handles the transformation.

    -transform_module
    Must be the URI /data-hub/5/transforms/mlcp-flow-transform.sjs. This module is added to the MODULES database when Data Hub is installed.
    -transform_param

    A comma-delimited list of key-value pairs to be passed to the /data-hub/5/transforms/mlcp-flow-transform.sjs module. Valid keys:

    • flow-name. The human-friendly name of the flow.
    • step. The number which represents the order of the step in the sequence.
    • job-id. (Optional) A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.
    • options. (Optional) Additional options to pass to the flow. Must be a JSON object.

What to do next

Create another flow, either using QuickStart or using Gradle. Then add mapping and mastering (combined or split) steps to enhance the ingested data.