MarkLogic Data Hub Service Tools

MarkLogic Data Hub Service tools that you can use with your service.

MarkLogic Connectors

MarkLogic Connector for AWS Glue
MarkLogic Connector for Apache Spark

MarkLogic Connector for AWS Glue

Version: 1.0.1


Version Compatibility
MarkLogic Connector for AWS Glue version	Environment	Supported Data Hub version(s)	Supported MarkLogic Server version(s)
1.0.1	DHS	5.2.0 up to the latest release	See Version Compatibility. Note: 10.0-7 up to the latest 10.x release is required to use export functionality.
1.0.1	On-Premises	5.2.0 up to the latest release	See Version Compatibility. Note: 10.0-7 up to the latest 10.x release is required to use export functionality.
1.0.0	DHS	5.2.0 up to the latest release	See Version Compatibility. Note: 10.0-5 up to the latest 10.x release is required to use export functionality.
1.0.0	On-Premises	5.2.0 up to the latest release	See Version Compatibility. Note: 10.0-5 up to the latest 10.x release is required to use export functionality.

The MarkLogic Connector for AWS Glue can be used within AWS Glue to ingest data into and export data from MarkLogic Server databases. The MarkLogic Connector for AWS Glue assumes you are hosting your data in MarkLogic Data Hub Service. The MarkLogic Connector for AWS Glue is configured using AWS Glue Studio. The connector can be used in AWS Glue scripts written in either Python or Scala. Options can be stored in AWS Secrets Manager, instead of your AWS Glue scripts. The connector can read from data sources that AWS Glue supports and write to MarkLogic (data target), and the connector always writes documents in JSON file format. The MarkLogic Connector for AWS Glue is available on the Amazon Marketplace.

Note: MarkLogic recommends storing the mlUsername and mlPassword options in AWS Secrets Manager.

Note: If hosting your data in your own on-premises servers, you must set the values as indicated in the ingest and export options.

Note: If you do not configure the connector using AWS Glue Studio, you must add your own JAR file as a dependent JAR for your AWS Glue script.

You are only required to subscribe to the MarkLogic Connector for AWS Glue once per AWS account. If you change regions, you must create a new connector and connection, and you are not required to resubscribe to the MarkLogic Connector for AWS Glue.

If connecting to a service on private endpoints (VPC peered network),

an Amazon VPC endpoint is required.
the connector, connection, and AWS Glue job must be in the same region as your client-side VPC.
provide additional information about your client-side VPC when you create the connection. You must provide the VPC ID (vpc-*), one private subnet CIDR (subnet-*), and one security group (sg-*).
create an inbound rule for the security group you provided. The inbound rule must have the following options:
- Type: All Traffic
- Protocol: All
- Port Range: All
- Source: The ID of the security group (sg-*)

The Data Hub properties can also be set when using the connector. For the full list of Data Hub properties, see Data Hub Properties.

Note: By default, the MarkLogic Connector for AWS Glue assumes it is connecting to MarkLogic Data Hub Service. The connector assumes that basic authentication and Secure Socket Layer (SSL)/Transport Layer Security (TLS) are used, mlIsProvisionedEnvironment is true, mlStagingPort is 8010, and hubDhs is true.

Ingest and Export options can be set when using the connector. For the full list of ingest and export options, see Connector Options: Ingest and Export.

Important: When using the connector with MarkLogic Data Hub Service, you must set all required ingest and export options. If you stored any options in AWS Secrets Manager, you must set the secretId option.

MarkLogic Connector for Apache Spark

Version: 1.0.1


Version Compatibility
MarkLogic Connector for Apache Spark version	Environment	Supported Data Hub version(s)	Supported MarkLogic Server version(s)
1.0.1	DHS	5.2.0 up to the latest release	See Version Compatibility. Note: 10.0-7 up to the latest 10.x release is required to use export functionality.
1.0.1	On-Premises	5.2.0 up to the latest release	See Version Compatibility. Note: 10.0-7 up to the latest 10.x release is required to use export functionality.
1.0.0	DHS	5.2.0 up to the latest release	See Version Compatibility. Note: 10.0-5 up to the latest 10.x release is required to use export functionality.
1.0.0	On-Premises	5.2.0 up to the latest release	See Version Compatibility. Note: If using Linux, 10.0-5 up to the latest 10.x release is required to use export functionality. If using Windows or Mac, 10.0-5 or 10.0-5.1 are required to use export functionality.

The MarkLogic Connector for Apache Spark can be used to ingest data into and export data from MarkLogic Server databases. The MarkLogic Connector for Apache Spark assumes you are hosting your data in MarkLogic Data Hub Service. The connector can be used in AWS Glue scripts written in either Python or Scala. The connector can read from data sources that Apache Spark supports and write to MarkLogic (data target), and the connector always writes documents in JSON file format. To begin using the MarkLogic Connector for Apache Spark, download the JAR file and then copy the JAR file to your Spark environment. For instructions on configuring and using the MarkLogic Connector for Apache Spark in AWS Glue, see Getting Started with the MarkLogic Connector for Apache Spark.

Note: For an example MarkLogic Connector for Apache Spark project in a local Spark environment, see Spark Test Project.

Note: If hosting your data in your own on-premises servers, you must set the values as indicated in the ingest and export options.

The Data Hub properties can also be set when using the connector. For the full list of Data Hub properties, see Data Hub Properties.

Note: By default, the MarkLogic Connector for Apache Spark assumes it is connecting to MarkLogic Data Hub Service. The connector assumes that basic authentication and Secure Socket Layer (SSL)/Transport Layer Security (TLS) are used, mlIsProvisionedEnvironment is true, mlStagingPort is 8010, and hubDhs is true.

Ingest and Export options can be set when using the connector. For the full list of ingest and export options, see Connector Options: Ingest and Export.

Important: When using the connector with MarkLogic Data Hub Service, you must set all required ingest and export options.

Example: Using Ingest Options

 writer=[DataFrame].write.format("com.marklogic.hub.spark.sql.sources.v2.DefaultSource")
 
// Required options
.option("mlHost", [Host])
.option("mlUsername", [Username])
.option("mlPassword", [Password])
 
// Set the following option to false if you are trying on localhost
// .option("hubDhs", "false")
 
// Optional options
.option(“mlStagingPort”, [mlstagingport])
.option("permissions", [permissions])
.option("additionalexternalmetadata", [additionalexternalmetadata])
.option("collections", [collections])
.option("uriprefix”, [uriprefix])
.option("uritemplate", [uritemplate])
.option(“sourcename”, [sourcename])
.option("sourcetype”, [sourcetype])
.option("writerecordsendpointparams", [writerecordsendpointparams])
.option("initializewriteapipath", [initializewriteapipath])
.option("finalizewriteapipath", [finalizewriteapipath])
 
.save();

Example: Using Export Options

 SparkSession session = newSparkSession();
SQLContext sqlContext = new SQLContext(session);
 
Dataset<Row> rows = sqlContext.read()

.format("com.marklogic.hub.spark.sql.sources.v2.DefaultSource")
 
// Required options
.option("mlHost", [Host])
.option("mlUsername", [Username])
.option("mlPassword", [Password])
.option("view", [View])
 
// Set the following option to false if you are trying on localhost
// .option("hubDhs", "false")
 
// Optional options
.option("schema", [schema])
.option("sqlcondition", [sqlcondition])
.option("selectedcolumns", [selectedcolumns])
.option(“serializedplan”, [serializedplan])
.option("sparkschema", [sparkschema])
.option("numpartitions", [numpartitions])
.option("optimizationlevel", [optimizationlevel])
.option("readrowsendpointparams", [readrowsendpointparams])
.option("initializereadapipath", [initializereadapipath])
 
.load();

Connector Options: Ingest and Export

Ingest Options
Export Options

Ingest Options

The following options are used to ingest data into a data target.

Note: If the job fails, the job record status is finished_with_errors. If the job is aborted, the job record status is canceled.

Ingest options are available for use with all MarkLogic connectors unless stated otherwise.


REQUIRED Ingest Options
Option	Description
`mlHost`	The hostname of your service. Example: `[*******].[*********].a.marklogicsvc.com` Tip: To locate the hostname of your service, you can click one of the endpoints and copy the hostname from your web browser. For details, see Access Private Endpoints or Access Public Endpoints. If connecting to an on-premises environment, enter the MarkLogic Server host where Data Hub should be installed or is installed.
`mlUsername` `mlPassword`	The username and password associated with a service user account assigned data-hub-operator or any service role that inherits it. To learn more, see MarkLogic Data Hub Users and Roles.
`className`	The name of the class in the MarkLogic Connector to which AWS Glue passes the dataframe. The value must be: `com.marklogic.hub.spark.sql.sources.v2.DefaultSource` For AWS Glue environment only. Important: The `className` option is required for any connector used in the AWS Glue environment.
`format`	The name of the Java class in the MarkLogic Connector from which Spark reads the data. The value must be: `com.marklogic.hub.spark.sql.sources.v2.DefaultSource` For Apache Spark connector only.
`secretId`	The secret name of the key-value pairs stored in AWS Secrets Manager. The `secretId` option is only required if you stored options in AWS Secrets Manager. Note: If you stored options in AWS Secrets Managers, do not also store these options in your AWS Glue scripts. For AWS Glue connector only.
OPTIONAL Ingest Options
Option	Description
`hubDhs`	The default value is `true`. If connecting to an on-premises environment, must be set to `false`. Important: The value of the `hubDhs` option also sets the value of the `hubSsl` option. For example, if `hubDhs` is `false`, then `hubSsl` is also `false`. If you would like `hubDhs` and `hubSsl` to have different values, you must set the `hubSsl` option.
`hubSsl`	The default value is `true`. If your Data Hub project does not require SSL/TLS, must be set to `false`. Important: If you would like `hubDhs` and `hubSsl` to have different values, you must set the `hubSsl` option.
`mlStagingPort`	The connector ingests data using the app server associated with the value of the `mlStagingPort` option. The default value is `8010`, which is the port used by the STAGING app server. To ingest data into a different database, set `mlStagingPort` to the port of the app server associated with the desired target database. For the full list of app severs and port numbers, see Provisioned App Servers.
`permissions`	The roles and capabilities used to set document permissions for each ingested document. Example: `[Role1],[Capability1],[Role2],[Capability2]` The default set document permissions are `data-hub-operator,read,data-hub-operator,update`. For a list of acceptable roles, see MarkLogic Data Hub Users and Roles. The capabilities you can set for the `permissions` option are `read`, `update`, `node-update`, `insert`, and `execute`. To learn more about capabilities, see Document Permissions in the Security Guide.
`additionalexternalmetadata`	The metadata of the ingested documents. The value must be a stringified JSON object or array. The values of this option will be added as child elements to the `externalMetadata` property in the job record.
`collections`	The collections to which the ingested documents will be attached. Important: If you enter more than one collection, you must separate each collection using a comma. Example: `[Collection1],[Collection2]`
`uriprefix`	The URI prefix of the documents ingested into MarkLogic. Use this option if you want random URIs with a prefix that is easy to find. The URIs are in the following format: `[uriprefix][UUID].json`. Note: If you do not set the `uriprefix` or `uritemplate` options, the URIs will be in the following format: `[UUID].json`.
`uritemplate`	The custom URI of the documents ingested into MarkLogic. Use this option if you want predictable URIs for your ingested documents based on column values defined in your schema. Example: `/Customer/{DepartmentName}/{CustomerId}.json` Important: If you set both the `uriprefix` and `uritemplate` options, the `uritemplate` option will have priority. The URI of the documents will follow the format defined using the `uritemplate` option. Note: If you do not set the `uriprefix` or `uritemplate` options, the URIs will be in the following format: `[UUID].json`.
`sourcename`	The custom source name for each ingested document. Use this option to describe the source of your ingested documents. The value of the `sourcename` option will be added to documents as the key `datahubSourceName`. Example: `CRM System`
`sourcetype`	The custom source type for each ingested document. Use this option to describe the data in your ingested documents. The value of the `sourcetype` option will be added to documents as the key `datahubSourceType`. Example: `Customer Record`
`writerecordsendpointparams`	The JSON object that calls the custom endpoint module. This option enables you to use a custom endpoint module that writes one or more JSON documents to MarkLogic. The connector will call the endpoint module associated with the custom API definition, instead of calling the default `writeRecords` endpoint module. The custom endpoint module must be installed in the data-hub-MODULES database. The value must be a JSON object. The JSON object can have the following keys: `apiPath`: This key is required. The URI of the custom API definition installed in the data-hub-MODULES database. Example: `/marklogic-glue-connector/customBulkIngester.api` `endpointState`: This key is optional. The properties that change with each call to the endpoint module. Content added to this key will be passed to the endpoint module associated with the `apiPath` key. `endpointConstants`: This key is optional. The properties that remain the same over the series of calls to the endpoint module. Content added to this key will be passed to the endpoint module associated with the `apiPath` key. Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database. Important: If you set the `writerecordsendpointparams` option, the `permissions`, `collections`, `uriprefix`, `uritemplate`, `sourcename`, and `sourcetype` options will not work as expected. Do not set these options if you set the `writerecordsendpointparams` option. Important: If the `endpointState` and/or `endpointConstants` keys are set but the `apiPath` key is not set, an exception will be thrown. Note: To learn more about using this option, see Data Services for IO.
`initializewriteapipath`	The URI of the custom `initializeWrite.api` endpoint module. The option enables you to use a custom endpoint module (`initializeWrite.api` and `initializeWrite.sjs`) for creating a job record in the data-hub-JOBS database before writing documents to MarkLogic. The connector will call the endpoint module associated with the custom API definition (`initializeWrite.api`), instead of calling the default `initializeWrite` endpoint module. The value must be the URI of `initializeWrite.api` installed in the data-hub-MODULES database. Example: `/custom-job-endpoints/initializeWrite.api` The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be: `initializeWrite.api` `initializeWrite.sjs` Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database. Note: To learn more about using this option, see Data Services for IO.
`finalizewriteapipath`	The URI of the custom `finalizeWrite.api` endpoint module. The option enables you to use a custom endpoint module (`finalizeWrite.api` and `finalizeWrite.sjs`) for updating a job record created in the data-hub-JOBS database. The connector will call the endpoint module associated with the custom API definition (`finalizeWrite.api`), instead of calling the default `finalizeWrite` endpoint module. The value must be the URI of `finalizeWrite.api` installed in the data-hub-MODULES database. Example: `/custom-job-endpoints/finalizeWrite.api` The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be: `finalizeWrite.api` `finalizeWrite.sjs` Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database. Note: To learn more about using this option, see Data Services for IO.

Export Options

The following options are used to export data from a data source.

Export options are available for use with all MarkLogic connectors unless stated otherwise.


REQUIRED Export Options
Option	Description
`mlHost`	The hostname of your service. Example: `[*******].[*********].a.marklogicsvc.com` Tip: To locate the hostname of your service, you can click one of the endpoints and copy the hostname from your web browser. For details, see Access Private Endpoints or Access Public Endpoints. If connecting to an on-premises environment, enter the MarkLogic Server host where Data Hub should be installed or is installed.
`mlUsername` `mlPassword`	The username and password associated with a service user account assigned data-hub-operator or any service role that inherits it. To learn more, see MarkLogic Data Hub Users and Roles.
`className`	The name of the class in the MarkLogic Connector to which AWS Glue passes the dataframe. The value must be: `com.marklogic.hub.spark.sql.sources.v2.DefaultSource` For AWS Glue environment only. Important: The `className` option is required for any connector used in the AWS Glue environment.
`format`	The name of the Java class in the MarkLogic Connector from which Spark reads the data. The value must be: `com.marklogic.hub.spark.sql.sources.v2.DefaultSource` For Apache Spark connector only.
`secretId`	The secret name of the key-value pairs stored in AWS Secrets Manager. The `secretId` option is only required if you stored options in AWS Secrets Manager. Note: If you stored options in AWS Secrets Managers, do not also store these options in your AWS Glue scripts. For AWS Glue connector only.
`view`	The Template Driven Extraction (TDE) view name from where the rows are retrieved. The view document must be installed in the schemas database. To learn more about TDE, see Template Driven Extraction (TDE) in the Application Developer's Guide. Important: This option is not required if you set the `serializedplan` or `readrowsendpointparams` option.
OPTIONAL Export Options
Option	Description
`hubDhs`	The default value is `true`. If connecting to an on-premises environment, must be set to `false`. Important: The value of the `hubDhs` option also sets the value of the `hubSsl` option. For example, if `hubDhs` is `false`, then `hubSsl` is also `false`. If you would like `hubDhs` and `hubSsl` to have different values, you must set the `hubSsl` option.
`hubSsl`	The default value is `true`. If your Data Hub project does not require SSL/TLS, must be set to `false`. Important: If you would like `hubDhs` and `hubSsl` to have different values, you must set the `hubSsl` option.
`mlFinalPort`	The connector exports data using the app server associated with the value of the `mlFinalPort` option. The default value is `8011`, which is the port used by the FINAL app server. To export data from a different database, set `mlFinalPort` to the port of the app server associated with the desired target database. For the full list of app severs and port numbers, see Provisioned App Servers.
`schema`	The schema of the data you want to read from MarkLogic. The value must be a string. Example: `Customer` The schema must be installed in the schemas database associated with the port number declared using the `mlFinalPort` option. For details about which app servers and databases are associated, see Servers and Databases.
`sqlcondition`	SQL conditions for filtering data by column names. The value must be a string. Example: `customerId > 3 and customerId < 7`
`selectedcolumns`	The columns that are shown in the output. The value must be a string. Example: `customerId` Important: If the value is not a column, an exception will be thrown.
`serializedplan`	The MarkLogic Optic API plan. You can set this option instead of setting the `view`, `schema`, `sqlcondition`, and `selectedcolumns` options. The value must be a valid JSON file. Example: `CustomerIdLessThanFive.json` To learn more about MarkLogic Optic API, see Optic API for Multi-Model Data Access in the Application Developer's Guide. Note: If you set the `serializedplan` option, do not also set the options declared using this option; otherwise, an exception will be thrown. For example, if you used the `serializedplan` option to declare a value for the `view` option, do not also set the `view` option. Note: If you do not set the `serializedplan` option, the endpoint module will dynamically build a MarkLogic Optic API plan.
`sparkschema`	The Spark schema of the data you want read from MarkLogic. The value must be a stringified JSON object. Note: If you do not set the `sparkschema` option, the endpoint module will dynamically build a Spark schema based on the MarkLogic Optic API plan.
`numpartitions`	The number of partitions created in your Apache Spark cluster. This option determines the number of partition readers that will query MarkLogic. The more partitians reading from MarkLogic, the less time is needed to read all of the data. The value must be an integer that is greater than or equal to `1`.
`optimizationlevel`	The optimization level you want to use for the query. The value must be `0`, `1`, or `2`. Important: For the MarkLogic Connector for AWS Glue only, the value of the option: must be a string if using Python. can be a string or integer if using Scala.
`readrowsendpointparams`	The JSON object that calls the custom read endpoint module. This option enables you to use a custom endpoint module (API and SJS files) that reads one or more JSON documents from MarkLogic. The connector will call the endpoint module associated with the custom API definition, instead of calling the default `readRecords` endpoint module. The custom endpoint module must be installed in the data-hub-MODULES database. The value must be a JSON object. The JSON object can have the following keys: `apiPath`: This key is required. The URI of the custom API definition installed in the data-hub-MODULES database. Example: `/marklogic-spark-connector/customExporter.api` `endpointState`: This key is optional. The properties that remain the same over the series of calls to the endpoint module. Content added to this key will be passed to the endpoint module associated with the `apiPath` key. Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database. Important: If you set the `readrowsendpointparams` option, the `view`, `schema`, `selectedcolumns`, `sqlcondition`, `serializedplan`, and `sparkschema` options will not work as expected. Do not set these options if you set the `readrowsendpointparams` option. Important: If the `endpointState` key is set but the `apiPath` key is not set, an exception will be thrown. Note: To learn more about using this option, see Data Services for IO.
`initializereadapipath`	The URI of the custom `initializeRead.api` endpoint module. The option enables you to use a custom endpoint module (`initializeRead.api` and `initializeRead.sjs`) for determining the schema. The connector will call the endpoint module associated with the custom API definition (`initializeRead.api`), instead of calling the default `initializeRead` endpoint module. The value must be the URI of `initializeRead.api` installed in the data-hub-MODULES database. Example: `/custom-job-endpoints/initializeRead.api` The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be: `initializeRead.api` `initializeRead.sjs` Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database. Note: To learn more about using this option, see Data Services for IO.