MarkLogic Data Hub Service Tools

MarkLogic Data Hub Service tools that you can use with your service.

MarkLogic Connectors

MarkLogic Connector for AWS Glue

Version: 1.0.0

Version Compatibility
MarkLogic Connector for AWS Glue version Environment Supported Data Hub version(s) Supported MarkLogic Server version(s)
1.0.0 DHS
Note: 10.0-5 up to the latest 10.x release is required to use export functionality.
On-Premises
Note: 10.0-5 up to the latest 10.x release is required to use export functionality.

The MarkLogic Connector for AWS Glue can be used within AWS Glue to ingest data into and export data from MarkLogic Server databases. The MarkLogic Connector for AWS Glue assumes you are hosting your data in MarkLogic Data Hub Service. The MarkLogic Connector for AWS Glue is configured using AWS Glue Studio. The connector can be used in AWS Glue scripts written in either Python or Scala. Options can be stored in AWS Secrets Manager, instead of your AWS Glue scripts. The connector can read from data sources that AWS Glue supports and write to MarkLogic (data target), and the connector always writes documents in JSON file format. The MarkLogic Connector for AWS Glue is available on the Amazon Marketplace.

Note: MarkLogic recommends storing the mlUsername and mlPassword options in AWS Secrets Manager.
Note: If hosting your data in your own on-premises servers, you must set the values as indicated in the ingest and export options.
Note: If you do not configure the connector using AWS Glue Studio, you must add your own JAR file as a dependent JAR for your AWS Glue script.

You are only required to subscribe to the MarkLogic Connector for AWS Glue once per AWS account. If you change regions, you must create a new connector and connection, and you are not required to resubscribe to the MarkLogic Connector for AWS Glue.

If connecting to a service on private endpoints (VPC peered network),

  • an Amazon VPC endpoint is required.
  • the connector, connection, and AWS Glue job must be in the same region as your client-side VPC.
  • provide additional information about your client-side VPC when you create the connection. You must provide the VPC ID (vpc-*), one private subnet CIDR (subnet-*), and one security group (sg-*).
  • create an inbound rule for the security group you provided. The inbound rule must have the following options:
    • Type: All Traffic
    • Protocol: All
    • Port Range: All
    • Source: The ID of the security group (sg-*)

The Data Hub properties can also be set when using the connector. For the full list of Data Hub properties, see Data Hub Properties.

Note: By default, the MarkLogic Connector for AWS Glue assumes it is connecting to MarkLogic Data Hub Service. The connector assumes that basic authentication and Secure Socket Layer (SSL)/Transport Layer Security (TLS) are used, mlIsProvisionedEnvironment is true, mlStagingPort is 8010, and hubDhs is true.

Ingest and Export options can be set when using the connector. For the full list of ingest and export options, see Connector Options: Ingest and Export.

Important: When using the connector with MarkLogic Data Hub Service, you must set all required ingest and export options. If you stored any options in AWS Secrets Manager, you must set the secretId option.

MarkLogic Connector for Apache Spark

Version: 1.0.0

Version Compatibility
MarkLogic Connector for Apache Spark version Environment Supported Data Hub version(s) Supported MarkLogic Server version(s)
1.0.0 DHS
Note: 10.0-5 up to the latest 10.x release is required to use export functionality.
On-Premises
Note:
  • If using Linux, 10.0-5 up to the latest 10.x release is required to use export functionality.
  • If using Windows or Mac, 10.0-5 or 10.0-5.1 are required to use export functionality.

The MarkLogic Connector for Apache Spark can be used to ingest data into and export data from MarkLogic Server databases. The MarkLogic Connector for Apache Spark assumes you are hosting your data in MarkLogic Data Hub Service. The connector can be used in AWS Glue scripts written in either Python or Scala. The connector can read from data sources that Apache Spark supports and write to MarkLogic (data target), and the connector always writes documents in JSON file format. To begin using the MarkLogic Connector for Apache Spark, download the JAR file and then copy the JAR file to your Spark environment. For instructions on configuring and using the MarkLogic Connector for Apache Spark in AWS Glue, see Getting Started with the MarkLogic Connector for Apache Spark.

Note: For an example MarkLogic Connector for Apache Spark project in a local Spark environment, see Spark Test Project.
Note: If hosting your data in your own on-premises servers, you must set the values as indicated in the ingest and export options.

The Data Hub properties can also be set when using the connector. For the full list of Data Hub properties, see Data Hub Properties.

Note: By default, the MarkLogic Connector for Apache Spark assumes it is connecting to MarkLogic Data Hub Service. The connector assumes that basic authentication and Secure Socket Layer (SSL)/Transport Layer Security (TLS) are used, mlIsProvisionedEnvironment is true, mlStagingPort is 8010, and hubDhs is true.

Ingest and Export options can be set when using the connector. For the full list of ingest and export options, see Connector Options: Ingest and Export.

Important: When using the connector with MarkLogic Data Hub Service, you must set all required ingest and export options.
Example: Using Ingest Options
 writer=[DataFrame].write.format("com.marklogic.hub.spark.sql.sources.v2.DefaultSource")
 
// Required options
.option("mlHost", [Host])
.option("mlUsername", [Username])
.option("mlPassword", [Password])
 
// Set the following option to false if you are trying on localhost
// .option("hubDhs", "false")
 
// Optional options
.option(“mlStagingPort”, [mlstagingport])
.option("permissions", [permissions])
.option("additionalexternalmetadata", [additionalexternalmetadata])
.option("collections", [collections])
.option("uriprefix”, [uriprefix])
.option("uritemplate", [uritemplate])
.option(“sourcename”, [sourcename])
.option("sourcetype”, [sourcetype])
.option("writerecordsendpointparams", [writerecordsendpointparams])
.option("initializewriteapipath", [initializewriteapipath])
.option("finalizewriteapipath", [finalizewriteapipath])
 
.save();
Example: Using Export Options
 SparkSession session = newSparkSession();
SQLContext sqlContext = new SQLContext(session);
 
Dataset<Row> rows = sqlContext.read()

.format("com.marklogic.hub.spark.sql.sources.v2.DefaultSource")
 
// Required options
.option("mlHost", [Host])
.option("mlUsername", [Username])
.option("mlPassword", [Password])
.option("view", [View])
 
// Set the following option to false if you are trying on localhost
// .option("hubDhs", "false")
 
// Optional options
.option("schema", [schema])
.option("sqlcondition", [sqlcondition])
.option("selectedcolumns", [selectedcolumns])
.option(“serializedplan”, [serializedplan])
.option("sparkschema", [sparkschema])
.option("numpartitions", [numpartitions])
.option("optimizationlevel", [optimizationlevel])
.option("readrowsendpointparams", [readrowsendpointparams])
.option("initializereadapipath", [initializereadapipath])
 
.load();

Connector Options: Ingest and Export

Ingest Options

The following options are used to ingest data into a data target.

Note: If the job fails, the job record status is finished_with_errors. If the job is aborted, the job record status is canceled.

Ingest options are available for use with all MarkLogic connectors unless stated otherwise.

REQUIRED Ingest Options
Option Description
mlHost The hostname of your service. Example: [*********].[***********].a.marklogicsvc.com
Tip: To locate the hostname of your service, you can click one of the endpoints and copy the hostname from your web browser. For details, see Access Private Endpoints or Access Public Endpoints.

If connecting to an on-premises environment, enter the MarkLogic Server host where Data Hub should be installed or is installed.

  • mlUsername
  • mlPassword
The username and password associated with a service user account assigned data-hub-operator or any service role that inherits it. To learn more, see MarkLogic Data Hub Users and Roles.
className The name of the class in the MarkLogic Connector to which AWS Glue passes the dataframe.

The value must be:

 com.marklogic.hub.spark.sql.sources.v2.DefaultSource

For AWS Glue environment only.

Important: The className option is required for any connector used in the AWS Glue environment.
format The name of the Java class in the MarkLogic Connector from which Spark reads the data.

The value must be:

 com.marklogic.hub.spark.sql.sources.v2.DefaultSource

For Apache Spark connector only.

secretId The secret name of the key-value pairs stored in AWS Secrets Manager. The secretId option is only required if you stored options in AWS Secrets Manager.
Note: If you stored options in AWS Secrets Managers, do not also store these options in your AWS Glue scripts.

For AWS Glue connector only.

OPTIONAL Ingest Options
Option Description
hubDhs The default value is true.

If connecting to an on-premises environment, must be set to false.

Important: The value of the hubDhs option also sets the value of the hubSsl option. For example, if hubDhs is false, then hubSsl is also false. If you would like hubDhs and hubSsl to have different values, you must set the hubSsl option.
hubSsl The default value is true.

If your Data Hub project does not require SSL/TLS, must be set to false.

Important: If you would like hubDhs and hubSsl to have different values, you must set the hubSsl option.
mlStagingPort The connector ingests data using the app server associated with the value of the mlStagingPort option. The default value is 8010, which is the port used by the STAGING app server. To ingest data into a different database, set mlStagingPort to the port of the app server associated with the desired target database. For the full list of app severs and port numbers, see Provisioned App Servers.
permissions The roles and capabilities used to set document permissions for each ingested document. Example: [Role1],[Capability1],[Role2],[Capability2]

The default set document permissions are data-hub-operator,read,data-hub-operator,update. For a list of acceptable roles, see MarkLogic Data Hub Users and Roles.

The capabilities you can set for the permissions option are read, update, node-update, insert, and execute. To learn more about capabilities, see Document Permissions in the Security Guide.

additionalexternalmetadata The metadata of the ingested documents. The value must be a stringified JSON object or array. The values of this option will be added as child elements to the externalMetadata property in the job record.
collections The collections to which the ingested documents will be attached.
Important: If you enter more than one collection, you must separate each collection using a comma. Example: [Collection1],[Collection2]
uriprefix The URI prefix of the documents ingested into MarkLogic. Use this option if you want random URIs with a prefix that is easy to find. The URIs are in the following format: [uriprefix][UUID].json.
Note: If you do not set the uriprefix or uritemplate options, the URIs will be in the following format: [UUID].json.
uritemplate The custom URI of the documents ingested into MarkLogic. Use this option if you want predictable URIs for your ingested documents based on column values defined in your schema. Example: /Customer/{DepartmentName}/{CustomerId}.json
Important: If you set both the uriprefix and uritemplate options, the uritemplate option will have priority. The URI of the documents will follow the format defined using the uritemplate option.
Note: If you do not set the uriprefix or uritemplate options, the URIs will be in the following format: [UUID].json.
sourcename The custom source name for each ingested document. Use this option to describe the source of your ingested documents. The value of the sourcename option will be added to documents as the key datahubSourceName. Example: CRM System
sourcetype The custom source type for each ingested document. Use this option to describe the data in your ingested documents. The value of the sourcetype option will be added to documents as the key datahubSourceType. Example: Customer Record
writerecordsendpointparams The JSON object that calls the custom endpoint module. This option enables you to use a custom endpoint module that writes one or more JSON documents to MarkLogic. The connector will call the endpoint module associated with the custom API definition, instead of calling the default writeRecords endpoint module. The custom endpoint module must be installed in the data-hub-MODULES database. The value must be a JSON object.

The JSON object can have the following keys:

  • apiPath: This key is required. The URI of the custom API definition installed in the data-hub-MODULES database. Example: /marklogic-glue-connector/customBulkIngester.api
  • endpointState: This key is optional. The properties that change with each call to the endpoint module. Content added to this key will be passed to the endpoint module associated with the apiPath key.
  • endpointConstants: This key is optional. The properties that remain the same over the series of calls to the endpoint module. Content added to this key will be passed to the endpoint module associated with the apiPath key.
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Important: If you set the writerecordsendpointparams option, the permissions, collections, uriprefix, uritemplate, sourcename, and sourcetype options will not work as expected. Do not set these options if you set the writerecordsendpointparams option.
Important: If the endpointState and/or endpointConstants keys are set but the apiPath key is not set, an exception will be thrown.
Note: To learn more about using this option, see Data Services for IO.
initializewriteapipath The URI of the custom initializeWrite.api endpoint module. The option enables you to use a custom endpoint module (initializeWrite.api and initializeWrite.sjs) for creating a job record in the data-hub-JOBS database before writing documents to MarkLogic. The connector will call the endpoint module associated with the custom API definition (initializeWrite.api), instead of calling the default initializeWrite endpoint module. The value must be the URI of initializeWrite.api installed in the data-hub-MODULES database. Example: /custom-job-endpoints/initializeWrite.api

The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be:

 initializeWrite.api
 initializeWrite.sjs
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Note: To learn more about using this option, see Data Services for IO.
finalizewriteapipath The URI of the custom finalizeWrite.api endpoint module. The option enables you to use a custom endpoint module (finalizeWrite.api and finalizeWrite.sjs) for updating a job record created in the data-hub-JOBS database. The connector will call the endpoint module associated with the custom API definition (finalizeWrite.api), instead of calling the default finalizeWrite endpoint module. The value must be the URI of finalizeWrite.api installed in the data-hub-MODULES database. Example: /custom-job-endpoints/finalizeWrite.api

The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be:

 finalizeWrite.api
 finalizeWrite.sjs
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Note: To learn more about using this option, see Data Services for IO.

Export Options

The following options are used to export data from a data source.

Export options are available for use with all MarkLogic connectors unless stated otherwise.

REQUIRED Export Options
Option Description
mlHost The hostname of your service. Example: [*********].[***********].a.marklogicsvc.com
Tip: To locate the hostname of your service, you can click one of the endpoints and copy the hostname from your web browser. For details, see Access Private Endpoints or Access Public Endpoints.

If connecting to an on-premises environment, enter the MarkLogic Server host where Data Hub should be installed or is installed.

  • mlUsername
  • mlPassword
The username and password associated with a service user account assigned data-hub-operator or any service role that inherits it. To learn more, see MarkLogic Data Hub Users and Roles.
className The name of the class in the MarkLogic Connector to which AWS Glue passes the dataframe.

The value must be:

 com.marklogic.hub.spark.sql.sources.v2.DefaultSource

For AWS Glue environment only.

Important: The className option is required for any connector used in the AWS Glue environment.
format The name of the Java class in the MarkLogic Connector from which Spark reads the data.

The value must be:

 com.marklogic.hub.spark.sql.sources.v2.DefaultSource

For Apache Spark connector only.

secretId The secret name of the key-value pairs stored in AWS Secrets Manager. The secretId option is only required if you stored options in AWS Secrets Manager.
Note: If you stored options in AWS Secrets Managers, do not also store these options in your AWS Glue scripts.

For AWS Glue connector only.

view The Template Driven Extraction (TDE) view name from where the rows are retrieved.

The view document must be installed in the schemas database.

To learn more about TDE, see Template Driven Extraction (TDE) in the Application Developer's Guide.

Important: This option is not required if you set the serializedplan or readrowsendpointparams option.
OPTIONAL Export Options
Option Description
hubDhs The default value is true.

If connecting to an on-premises environment, must be set to false.

Important: The value of the hubDhs option also sets the value of the hubSsl option. For example, if hubDhs is false, then hubSsl is also false. If you would like hubDhs and hubSsl to have different values, you must set the hubSsl option.
hubSsl The default value is true.

If your Data Hub project does not require SSL/TLS, must be set to false.

Important: If you would like hubDhs and hubSsl to have different values, you must set the hubSsl option.
mlFinalPort The connector exports data using the app server associated with the value of the mlFinalPort option. The default value is 8011, which is the port used by the FINAL app server. To export data from a different database, set mlFinalPort to the port of the app server associated with the desired target database. For the full list of app severs and port numbers, see Provisioned App Servers.
schema The schema of the data you want to read from MarkLogic. The value must be a string. Example: Customer

The schema must be installed in the schemas database associated with the port number declared using the mlFinalPort option.

For details about which app servers and databases are associated, see Servers and Databases.

sqlcondition SQL conditions for filtering data by column names. The value must be a string. Example: customerId > 3 and customerId < 7
selectedcolumns The columns that are shown in the output. The value must be a string. Example: customerId
Important: If the value is not a column, an exception will be thrown.
serializedplan The MarkLogic Optic API plan. You can set this option instead of setting the view, schema, sqlcondition, and selectedcolumns options. The value must be a valid JSON file. Example: CustomerIdLessThanFive.json

To learn more about MarkLogic Optic API, see Optic API for Multi-Model Data Access in the Application Developer's Guide.

Note: If you set the serializedplan option, do not also set the options declared using this option; otherwise, an exception will be thrown. For example, if you used the serializedplan option to declare a value for the view option, do not also set the view option.
Note: If you do not set the serializedplan option, the endpoint module will dynamically build a MarkLogic Optic API plan.
sparkschema The Spark schema of the data you want read from MarkLogic. The value must be a stringified JSON object.
Note: If you do not set the sparkschema option, the endpoint module will dynamically build a Spark schema based on the MarkLogic Optic API plan.
numpartitions The number of partitions created in your Apache Spark cluster. This option determines the number of partition readers that will query MarkLogic. The more partitians reading from MarkLogic, the less time is needed to read all of the data. The value must be an integer that is greater than or equal to 1.
optimizationlevel The optimization level you want to use for the query. The value must be 0, 1, or 2.
Important: For the MarkLogic Connector for AWS Glue only, the value of the option:
  • must be a string if using Python.
  • can be a string or integer if using Scala.
readrowsendpointparams The JSON object that calls the custom read endpoint module. This option enables you to use a custom endpoint module (API and SJS files) that reads one or more JSON documents from MarkLogic. The connector will call the endpoint module associated with the custom API definition, instead of calling the default readRecords endpoint module. The custom endpoint module must be installed in the data-hub-MODULES database. The value must be a JSON object.

The JSON object can have the following keys:

  • apiPath: This key is required. The URI of the custom API definition installed in the data-hub-MODULES database. Example: /marklogic-spark-connector/customExporter.api
  • endpointState: This key is optional. The properties that remain the same over the series of calls to the endpoint module. Content added to this key will be passed to the endpoint module associated with the apiPath key.
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Important: If the endpointState key is set but the apiPath key is not set, an exception will be thrown.
Note: To learn more about using this option, see Data Services for IO.
initializereadapipath The URI of the custom initializeRead.api endpoint module. The option enables you to use a custom endpoint module (initializeRead.api and initializeRead.sjs) for determining the schema. The connector will call the endpoint module associated with the custom API definition (initializeRead.api), instead of calling the default initializeRead endpoint module. The value must be the URI of initializeRead.api installed in the data-hub-MODULES database. Example: /custom-job-endpoints/initializeRead.api

The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be:

 initializeRead.api
 initializeRead.sjs
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Note: To learn more about using this option, see Data Services for IO.