Ingest Using MLCP
MarkLogic Content Pump (MLCP) is a standalone Java utility provided by MarkLogic.
- one containing only the ingestion step using MLCP, and
- one with mapping and mastering (combined or split) steps using other tools.
Before you begin
You need:
- MLCP
- A MarkLogic Data Hub project
Procedure
- Create a flow.
- Create an ingestion step.
Note: The parameter values you use at the command-line call to MLCP will override the settings in the ingestion step.
- Open a command-line window, and navigate to your MLCP project root directory.
-
At your project root, run the
mlcp
command with the-transform*
parameters.Note: MLCP must generate unique URIs. Setting-generate_uri "true"
is recommended. Otherwise, specify-uri_id
with a column that contains unique values.- Using Gradle: Copy and paste the following code to the command line, customize it, and run it.
mlcp.sh import \ -host "localhost" -port "8010" \ -username "flow-operator-user-account" -password "*****" \ -input_file_path "/path/to/your/input/directory" \ -input_file_type "delimited_text" \ -output_collections "ingestion-only,input" \ -output_permissions "data-hub-common,read,data-hub-common,update" \ -generate_uri "true" \ -output_uri_replace ".*input,'/ingestion-flow/json/'" \ -document_type "json" \ -transform_module "/data-hub/5/transforms/mlcp-flow-transform.sjs" \ -transform_param "flow-name=MyFlow,step=1"
mlcp.bat import ^ -host "localhost" -port "8010" ^ -username "flow-operator-user-account" -password "*****" ^ -input_file_path "/path/to/your/input/directory" ^ -input_file_type "delimited_text" ^ -output_collections "ingestion-only,input" ^ -output_permissions "data-hub-common,read,data-hub-common,update" ^ -generate_uri "true" ^ -output_uri_replace ".*input,'/ingestion-flow/json/'" ^ -document_type "json" ^ -transform_module "/data-hub/5/transforms/mlcp-flow-transform.sjs" ^ -transform_param "flow-name=MyFlow,step=1"
-host
- The host for the MarkLogic Server instance.
-port
- The port for the MarkLogic Server instance.
-input_file_path
- The location of your source files.
-input_file_type
- The format of your source files: Text, JSON, XML, Binary, or Delimited Text. See MLCP: Supported Input Format Summary.
-generate_uri
- If
true
, a custom URI is associated with each new record. -output_collections
- A comma-separated string containing the collection tags to associate with the new records.
-output_permissions
- The comma-separated roles required to access the ingested data.
-output_uri_replace
-
A comma-separated list of replacements used to customize the URIs of the ingested records.
The list is comprised of regular expression patterns and their replacement strings in the format
pattern,'string',pattern,'string',...
. The replacement strings must be enclosed in single quotes.For example, if the original URI is in the form "/foo/bar/filename", you can customize it to be "/mydir/filename" using the following comma-separated list:
/foo/bar,'/mydir'
Java's regular expression language is supported.
If Source File Type is set to
CSV
, the substitution is based on the absolute path of the parent folder; otherwise, the absolute path of the input file. For example, if the Windows path is c:\path\filename, the substitution is based on /c/path/filename.Learn more: Transforming the Default URI.
-document_type
- The format of the processed record: Text, JSON, XML, or Binary.
Valid values:
json
xml
text
binary
If you specify a document type that is different from the input file type, MLCP handles the transformation.
-transform_module
- Must be the URI /data-hub/5/transforms/mlcp-flow-transform.sjs. This module is added to the MODULES database when Data Hub is installed.
-transform_param
-
A comma-delimited list of key-value pairs to be passed to the /data-hub/5/transforms/mlcp-flow-transform.sjs module. Valid keys:
flow-name
. The human-friendly name of the flow.step
. The number which represents the order of the step in the sequence.job-id
. (Optional) A unique job ID to associate with the flow run. This option can be used if the flow run is part of a larger process (e.g., a process orchestrated by NiFi with its own job/process ID). Must not be the same as an existing Data Hub job ID. If not provided, a unique Data Hub job ID will be assigned.options
. (Optional) Additional options to pass to the flow. Must be a JSON object.
- Using Gradle: Copy and paste the following code to the command line, customize it, and run it.