Ingest Using MLCP
MarkLogic Content Pump (MLCP) is a standalone Java utility provided by MarkLogic.
- one containing only the ingestion step using MLCP, and
- one with mapping and mastering (combined or split) steps using other tools.
Before you begin
You need:
- MLCP
- A MarkLogic Data Hub project
Procedure
- Create a flow.
- Create an ingestion step.
Note: The parameter values you use at the command-line call to MLCP will override the settings in the ingestion step.
- Configure the ingestion step, and copy the MLCP Command code.
After choosing the settings for your ingestion step, copy the contents of the MLCP Command field to your clipboard.
- Open a command-line window, and navigate to your MLCP project root directory.
-
At your project root, run the
mlcpcommand with the-transform*parameters.- Using QuickStart: Paste the MLCP Command code to the command line and run it.
- Using Gradle: Copy and paste the following code to the command line, customize it, and run it.
mlcp.sh import \ -host "localhost" -port "8010" \ -username "flow-operator-user-account" -password "*****" \ -input_file_path "/path/to/your/input/directory" \ -input_file_type "delimited_text" \ -output_collections "ingestion-only,input" \ -output_permissions "data-hub-common,read,data-hub-common,update" \ -generate_uri "true" \ -output_uri_replace ".*input,'/ingestion-flow/json/'" \ -document_type "json" \ -transform_module "/data-hub/5/transforms/mlcp-flow-transform.sjs" \ -transform_param "flow-name=MyFlow,step=1"mlcp.bat import ^ -host "localhost" -port "8010" ^ -username "flow-operator-user-account" -password "*****" ^ -input_file_path "/path/to/your/input/directory" ^ -input_file_type "delimited_text" ^ -output_collections "ingestion-only,input" ^ -output_permissions "data-hub-common,read,data-hub-common,update" ^ -generate_uri "true" ^ -output_uri_replace ".*input,'/ingestion-flow/json/'" ^ -document_type "json" ^ -transform_module "/data-hub/5/transforms/mlcp-flow-transform.sjs" ^ -transform_param "flow-name=MyFlow,step=1"
-host- The host for the MarkLogic Server instance.
-port- The port for the MarkLogic Server instance.
-input_file_path- The location of your source files.
-input_file_type- The format of your source files: Text, JSON, XML, Binary, or Delimited Text. See MLCP: Supported Input Format Summary.
-generate_uri- If
true, a custom URI is associated with each new record. -output_collections- A comma-separated string containing the collection tags to associate with the new records.
-output_permissions- The comma-separated roles required to access the ingested data.
-output_uri_replace-
A comma-separated list of replacements used to customize the URIs of the ingested records.
The list is comprised of regular expression patterns and their replacement strings in the format
pattern,'string',pattern,'string',.... The replacement strings must be enclosed in single quotes.For example, if the original URI is in the form "/foo/bar/filename", you can customize it to be "/mydir/filename" using the following comma-separated list:
/foo/bar,'/mydir'Java's regular expression language is supported.
If Source File Type is set to
CSV, the substitution is based on the absolute path of the parent folder; otherwise, the absolute path of the input file. For example, if the Windows path is c:\path\filename, the substitution is based on /c/path/filename.Learn more: Transforming the Default URI.
-document_type- The format of the processed record: Text, JSON, XML, or Binary.
Valid values:
jsonxmltextbinary
If you specify a document type that is different from the input file type, MLCP handles the transformation.
-transform_module- Must be the URI /data-hub/5/transforms/mlcp-flow-transform.sjs. This module is added to the MODULES database when Data Hub is installed.
-transform_param-
A comma-delimited list of key-value pairs to be passed to the /data-hub/5/transforms/mlcp-flow-transform.sjs module. Valid keys:
flow-name. The human-friendly name of the flow.step. The number which represents the order of the step in the sequence.job-id. (Optional) A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.options. (Optional) Additional options to pass to the flow. Must be a JSON object.