MarkLogic Data Hub Service Tools
MarkLogic Data Hub Service tools that you can use with your service.
MarkLogic Connectors
MarkLogic Connector for AWS Glue
Version: 1.0.1
Version Compatibility | |||
---|---|---|---|
MarkLogic Connector for AWS Glue version | Environment | Supported Data Hub version(s) | Supported MarkLogic Server version(s) |
1.0.1 | DHS |
Note: 10.0-7 up to the latest 10.x release is required to use export functionality.
|
|
On-Premises |
Note: 10.0-7 up to the latest 10.x release is required to use export functionality.
|
||
1.0.0 | DHS |
Note: 10.0-5 up to the latest 10.x release is required to use export functionality.
|
|
On-Premises |
Note: 10.0-5 up to the latest 10.x release is required to use export functionality.
|
The MarkLogic Connector for AWS Glue can be used within AWS Glue to ingest data into and export data from MarkLogic Server databases. The MarkLogic Connector for AWS Glue assumes you are hosting your data in MarkLogic Data Hub Service. The MarkLogic Connector for AWS Glue is configured using AWS Glue Studio. The connector can be used in AWS Glue scripts written in either Python or Scala. Options can be stored in AWS Secrets Manager, instead of your AWS Glue scripts. The connector can read from data sources that AWS Glue supports and write to MarkLogic (data target), and the connector always writes documents in JSON file format. The MarkLogic Connector for AWS Glue is available on the Amazon Marketplace.
You are only required to subscribe to the MarkLogic Connector for AWS Glue once per AWS account. If you change regions, you must create a new connector and connection, and you are not required to resubscribe to the MarkLogic Connector for AWS Glue.
If connecting to a service on private endpoints (VPC peered network),
- an Amazon VPC endpoint is required.
- the connector, connection, and AWS Glue job must be in the same region as your client-side VPC.
- provide additional information about your client-side VPC when you create the connection. You must provide the VPC ID (
vpc-*
), one private subnet CIDR (subnet-*
), and one security group (sg-*
). - create an inbound rule for the security group you provided. The inbound rule must have the following options:
- Type:
All Traffic
- Protocol:
All
- Port Range:
All
- Source: The ID of the security group (
sg-*
)
- Type:
The Data Hub properties can also be set when using the connector. For the full list of Data Hub properties, see Data Hub Properties.
basic
authentication and Secure Socket Layer (SSL)/Transport Layer Security (TLS) are used, mlIsProvisionedEnvironment is true
, mlStagingPort is 8010
, and hubDhs is true
.Ingest and Export options can be set when using the connector. For the full list of ingest and export options, see Connector Options: Ingest and Export.
MarkLogic Connector for Apache Spark
Version: 1.0.1
Version Compatibility | |||
---|---|---|---|
MarkLogic Connector for Apache Spark version | Environment | Supported Data Hub version(s) | Supported MarkLogic Server version(s) |
1.0.1 | DHS |
Note: 10.0-7 up to the latest 10.x release is required to use export functionality.
|
|
On-Premises |
Note: 10.0-7 up to the latest 10.x release is required to use export functionality.
|
||
1.0.0 | DHS |
Note: 10.0-5 up to the latest 10.x release is required to use export functionality.
|
|
On-Premises |
Note:
|
The MarkLogic Connector for Apache Spark can be used to ingest data into and export data from MarkLogic Server databases. The MarkLogic Connector for Apache Spark assumes you are hosting your data in MarkLogic Data Hub Service. The connector can be used in AWS Glue scripts written in either Python or Scala. The connector can read from data sources that Apache Spark supports and write to MarkLogic (data target), and the connector always writes documents in JSON file format. To begin using the MarkLogic Connector for Apache Spark, download the JAR file and then copy the JAR file to your Spark environment. For instructions on configuring and using the MarkLogic Connector for Apache Spark in AWS Glue, see Getting Started with the MarkLogic Connector for Apache Spark.
The Data Hub properties can also be set when using the connector. For the full list of Data Hub properties, see Data Hub Properties.
basic
authentication and Secure Socket Layer (SSL)/Transport Layer Security (TLS) are used, mlIsProvisionedEnvironment is true
, mlStagingPort is 8010
, and hubDhs is true
.Ingest and Export options can be set when using the connector. For the full list of ingest and export options, see Connector Options: Ingest and Export.
Example: Using Ingest Options
writer=[DataFrame].write.format("com.marklogic.hub.spark.sql.sources.v2.DefaultSource")
// Required options
.option("mlHost", [Host])
.option("mlUsername", [Username])
.option("mlPassword", [Password])
// Set the following option to false if you are trying on localhost
// .option("hubDhs", "false")
// Optional options
.option(“mlStagingPort”, [mlstagingport])
.option("permissions", [permissions])
.option("additionalexternalmetadata", [additionalexternalmetadata])
.option("collections", [collections])
.option("uriprefix”, [uriprefix])
.option("uritemplate", [uritemplate])
.option(“sourcename”, [sourcename])
.option("sourcetype”, [sourcetype])
.option("writerecordsendpointparams", [writerecordsendpointparams])
.option("initializewriteapipath", [initializewriteapipath])
.option("finalizewriteapipath", [finalizewriteapipath])
.save();
Example: Using Export Options
SparkSession session = newSparkSession();
SQLContext sqlContext = new SQLContext(session);
Dataset<Row> rows = sqlContext.read()
.format("com.marklogic.hub.spark.sql.sources.v2.DefaultSource")
// Required options
.option("mlHost", [Host])
.option("mlUsername", [Username])
.option("mlPassword", [Password])
.option("view", [View])
// Set the following option to false if you are trying on localhost
// .option("hubDhs", "false")
// Optional options
.option("schema", [schema])
.option("sqlcondition", [sqlcondition])
.option("selectedcolumns", [selectedcolumns])
.option(“serializedplan”, [serializedplan])
.option("sparkschema", [sparkschema])
.option("numpartitions", [numpartitions])
.option("optimizationlevel", [optimizationlevel])
.option("readrowsendpointparams", [readrowsendpointparams])
.option("initializereadapipath", [initializereadapipath])
.load();
Connector Options: Ingest and Export
Ingest Options
The following options are used to ingest data into a data target.
finished_with_errors
. If the job is aborted, the job record status is canceled
.Ingest options are available for use with all MarkLogic connectors unless stated otherwise.
REQUIRED Ingest Options | |
---|---|
Option | Description |
mlHost | The hostname of your service. Example: [*********].[***********].a.marklogicsvc.com
Tip: To locate the hostname of your service, you can click one of the endpoints and copy the hostname from your web browser. For details, see Access Private Endpoints or Access Public Endpoints.
If connecting to an on-premises environment, enter the MarkLogic Server host where Data Hub should be installed or is installed. |
|
The username and password associated with a service user account assigned data-hub-operator or any service role that inherits it. To learn more, see MarkLogic Data Hub Users and Roles. |
className | The name of the class in the MarkLogic Connector to which AWS Glue passes the dataframe.
The value must be:
For AWS Glue environment only. Important: The className option is required for any connector used in the AWS Glue environment.
|
format | The name of the Java class in the MarkLogic Connector from which Spark reads the data.
The value must be:
For Apache Spark connector only. |
secretId | The secret name of the key-value pairs stored in AWS Secrets Manager. The secretId option is only required if you stored options in AWS Secrets Manager.
Note: If you stored options in AWS Secrets Managers, do not also store these options in your AWS Glue scripts.
For AWS Glue connector only. |
OPTIONAL Ingest Options | |
Option | Description |
hubDhs | The default value is true .
If connecting to an on-premises environment, must be set to Important: The value of the hubDhs option also sets the value of the hubSsl option. For example, if hubDhs is
false , then hubSsl is also false . If you would like hubDhs and hubSsl to have different values, you must set the hubSsl option. |
hubSsl | The default value is true .
If your Data Hub project does not require SSL/TLS, must be set to Important: If you would like hubDhs and hubSsl to have different values, you must set the hubSsl option.
|
mlStagingPort | The connector ingests data using the app server associated with the value of the mlStagingPort option. The default value is 8010 , which is the port used by the STAGING app server. To ingest data into a different database, set mlStagingPort to the port of the app server associated with the desired target database. For the full list of app severs and port numbers, see Provisioned App Servers. |
permissions | The roles and capabilities used to set document permissions for each ingested document. Example: [Role1],[Capability1],[Role2],[Capability2]
The default set document permissions are The capabilities you can set for the permissions option are |
additionalexternalmetadata | The metadata of the ingested documents. The value must be a stringified JSON object or array. The values of this option will be added as child elements to the externalMetadata property in the job record. |
collections | The collections to which the ingested documents will be attached.
Important: If you enter more than one collection, you must separate each collection using a comma. Example:
[Collection1],[Collection2] |
uriprefix | The URI prefix of the documents ingested into MarkLogic. Use this option if you want random URIs with a prefix that is easy to find. The URIs are in the following format: [uriprefix][UUID].json .
Note: If you do not set the uriprefix or uritemplate options, the URIs will be in the following format:
[UUID].json . |
uritemplate | The custom URI of the documents ingested into MarkLogic. Use this option if you want predictable URIs for your ingested documents based on column values defined in your schema. Example: /Customer/{DepartmentName}/{CustomerId}.json
Important: If you set both the uriprefix and uritemplate options, the uritemplate option will have priority. The URI of the documents will follow the format defined using the uritemplate option.
Note: If you do not set the uriprefix or uritemplate options, the URIs will be in the following format:
[UUID].json . |
sourcename | The custom source name for each ingested document. Use this option to describe the source of your ingested documents. The value of the sourcename option will be added to documents as the key datahubSourceName. Example: CRM System |
sourcetype | The custom source type for each ingested document. Use this option to describe the data in your ingested documents. The value of the sourcetype option will be added to documents as the key datahubSourceType. Example: Customer Record |
writerecordsendpointparams | The JSON object that calls the custom endpoint module. This option enables you to use a custom endpoint module that writes one or more JSON documents to MarkLogic. The connector will call the endpoint module associated with the custom API definition, instead of calling the default writeRecords endpoint module. The custom endpoint module must be installed in the data-hub-MODULES database. The value must be a JSON object.
The JSON object can have the following keys:
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Important: If you set the writerecordsendpointparams option, the permissions, collections, uriprefix, uritemplate, sourcename, and sourcetype options will not work as expected. Do not set these options if you set the writerecordsendpointparams option.
Important: If the endpointState and/or endpointConstants keys are set but the apiPath key is not set, an exception will be thrown.
Note: To learn more about using this option, see Data Services for IO.
|
initializewriteapipath | The URI of the custom initializeWrite.api endpoint module. The option enables you to use a custom endpoint module (initializeWrite.api and initializeWrite.sjs ) for creating a job record in the data-hub-JOBS database before writing documents to MarkLogic. The connector will call the endpoint module associated with the custom API definition (initializeWrite.api ), instead of calling the default initializeWrite endpoint module. The value must be the URI of initializeWrite.api installed in the data-hub-MODULES database. Example: /custom-job-endpoints/initializeWrite.api
The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be:
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Note: To learn more about using this option, see Data Services for IO.
|
finalizewriteapipath | The URI of the custom finalizeWrite.api endpoint module. The option enables you to use a custom endpoint module (finalizeWrite.api and finalizeWrite.sjs ) for updating a job record created in the data-hub-JOBS database. The connector will call the endpoint module associated with the custom API definition (finalizeWrite.api ), instead of calling the default finalizeWrite endpoint module. The value must be the URI of finalizeWrite.api installed in the data-hub-MODULES database. Example: /custom-job-endpoints/finalizeWrite.api
The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be:
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Note: To learn more about using this option, see Data Services for IO.
|
Export Options
The following options are used to export data from a data source.
Export options are available for use with all MarkLogic connectors unless stated otherwise.
REQUIRED Export Options | |
---|---|
Option | Description |
mlHost | The hostname of your service. Example: [*********].[***********].a.marklogicsvc.com
Tip: To locate the hostname of your service, you can click one of the endpoints and copy the hostname from your web browser. For details, see Access Private Endpoints or Access Public Endpoints.
If connecting to an on-premises environment, enter the MarkLogic Server host where Data Hub should be installed or is installed. |
|
The username and password associated with a service user account assigned data-hub-operator or any service role that inherits it. To learn more, see MarkLogic Data Hub Users and Roles. |
className | The name of the class in the MarkLogic Connector to which AWS Glue passes the dataframe.
The value must be:
For AWS Glue environment only. Important: The className option is required for any connector used in the AWS Glue environment.
|
format | The name of the Java class in the MarkLogic Connector from which Spark reads the data.
The value must be:
For Apache Spark connector only. |
secretId | The secret name of the key-value pairs stored in AWS Secrets Manager. The secretId option is only required if you stored options in AWS Secrets Manager.
Note: If you stored options in AWS Secrets Managers, do not also store these options in your AWS Glue scripts.
For AWS Glue connector only. |
view | The Template Driven Extraction (TDE) view name from where the rows are retrieved.
The view document must be installed in the schemas database. To learn more about TDE, see Template Driven Extraction (TDE) in the Application Developer's Guide. Important: This option is not required if you set the serializedplan or readrowsendpointparams option.
|
OPTIONAL Export Options | |
Option | Description |
hubDhs | The default value is true .
If connecting to an on-premises environment, must be set to Important: The value of the hubDhs option also sets the value of the hubSsl option. For example, if hubDhs is
false , then hubSsl is also false . If you would like hubDhs and hubSsl to have different values, you must set the hubSsl option. |
hubSsl | The default value is true .
If your Data Hub project does not require SSL/TLS, must be set to Important: If you would like hubDhs and hubSsl to have different values, you must set the hubSsl option.
|
mlFinalPort | The connector exports data using the app server associated with the value of the mlFinalPort option. The default value is 8011 , which is the port used by the FINAL app server. To export data from a different database, set mlFinalPort to the port of the app server associated with the desired target database. For the full list of app severs and port numbers, see Provisioned App Servers. |
schema | The schema of the data you want to read from MarkLogic. The value must be a string. Example: Customer
The schema must be installed in the schemas database associated with the port number declared using the mlFinalPort option. For details about which app servers and databases are associated, see Servers and Databases. |
sqlcondition | SQL conditions for filtering data by column names. The value must be a string. Example: customerId > 3 and customerId < 7 |
selectedcolumns | The columns that are shown in the output. The value must be a string. Example: customerId
Important: If the value is not a column, an exception will be thrown.
|
serializedplan | The MarkLogic Optic API plan. You can set this option instead of setting the view, schema, sqlcondition, and selectedcolumns options. The value must be a valid JSON file. Example: CustomerIdLessThanFive.json
To learn more about MarkLogic Optic API, see Optic API for Multi-Model Data Access in the Application Developer's Guide. Note: If you set the serializedplan option, do not also set the options declared using this option; otherwise, an exception will be thrown. For example, if you used the serializedplan option to declare a value for the view option, do not also set the view option.
Note: If you do not set the serializedplan option, the endpoint module will dynamically build a MarkLogic Optic API plan.
|
sparkschema | The Spark schema of the data you want read from MarkLogic. The value must be a stringified JSON object.
Note: If you do not set the sparkschema option, the endpoint module will dynamically build a Spark schema based on the MarkLogic Optic API plan.
|
numpartitions | The number of partitions created in your Apache Spark cluster. This option determines the number of partition readers that will query MarkLogic. The more partitians reading from MarkLogic, the less time is needed to read all of the data. The value must be an integer that is greater than or equal to 1 . |
optimizationlevel | The optimization level you want to use for the query. The value must be 0 , 1 , or 2 .
Important: For the MarkLogic Connector for AWS Glue only, the value of the option:
|
readrowsendpointparams | The JSON object that calls the custom read endpoint module. This option enables you to use a custom endpoint module (API and SJS files) that reads one or more JSON documents from MarkLogic. The connector will call the endpoint module associated with the custom API definition, instead of calling the default readRecords endpoint module. The custom endpoint module must be installed in the data-hub-MODULES database. The value must be a JSON object.
The JSON object can have the following keys:
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Important: If you set the readrowsendpointparams option, the view, schema, selectedcolumns, sqlcondition, serializedplan, and sparkschema options will not work as expected. Do not set these options if you set the readrowsendpointparams option.
Important: If the endpointState key is set but the apiPath key is not set, an exception will be thrown.
Note: To learn more about using this option, see Data Services for IO.
|
initializereadapipath | The URI of the custom initializeRead.api endpoint module. The option enables you to use a custom endpoint module (initializeRead.api and initializeRead.sjs ) for determining the schema. The connector will call the endpoint module associated with the custom API definition (initializeRead.api ), instead of calling the default initializeRead endpoint module. The value must be the URI of initializeRead.api installed in the data-hub-MODULES database. Example: /custom-job-endpoints/initializeRead.api
The custom endpoint module must be installed in the data-hub-MODULES database. The custom endpoint module names must be:
Important: Before running the AWS Glue job, the custom endpoint module must be installed in the data-hub-MODULES database.
Note: To learn more about using this option, see Data Services for IO.
|