IncrementalLoadFromJdbc (MarkLogic Java Client API 6.0.0)

java.lang.Object
- com.marklogic.client.example.cookbook.datamovement.BulkLoadFromJdbcWithSimpleJoins
- - com.marklogic.client.example.cookbook.datamovement.IncrementalLoadFromJdbc

```
public class IncrementalLoadFromJdbc
extends BulkLoadFromJdbcWithSimpleJoins
```
Demonstrates a way to load a massive volume of data updates from JDBC into MarkLogic, assuming the source cannot identify changes and deletes. In this example source data (accessed via JDBC) continues to grow and evolve, so updates from the source must be regularly incporated into the target system (MarkLogic Server). These updates include new documents, updated documents, and deleted documents. The source data is too large to ingest completely every time. So this example addresses the more difficult scenario where incremental loads are required to include only the updates. Many source systems offer a document version or last updated time-stamp. This pattern addresses the more difficult scenario where the source system offers no such option. Additionally, this example addresses the more difficult scenario where the source system can provide a list of all current document uris, but cannot provide any information about modified or deleted documents. # Solution ## Step 1 The process begins reading documents directly from JDBC, adding each to uriQueue and sourceEmployees. One batch at a time the uris from uriQueue are used to retrieve hashcodes from the target. For each source document the hashcode is generated in memory and compared to the hashcode from the target (if available). Documents with no hashcode in the target are considered new and written to the target. Documents with a hashcode different from the target are considered updated and written to the target. In all cases a sidecar document including the current source hashcode is written to the target. This is all done in batches to reduce overhead on the application, source, and target systems. In addition, DMSDK processes batches in multiple threads and against multiple MarkLogic hosts to fully utilize the MarkLogic cluster. ## Step 2 Any document written to MarkLogic Server also has written a "sidecar" document containing metadata including the document uri, a hashcode and a jobName. The sidecar document has a collection representing the data source. The hascode is generated based on select portions of the source document contents. The hascode algorithm is consistent when the source document hasn't changed and different any time the source document has changed. The jobName is any id or timestamp representing the last job which checked the hashcode of the document, and should differ from previous job runs. This sidecar document is updated with each job run to reflect the latest jobName. ## Step 3 As the last step of a job run, a query returns all sidecar files with the collection for this datasource but a jobName different than the current jobName which indicates these documents are in MarkLogic but were missing from this job run and are therefore not in the datasource. After confirming that these documents are legitimately not in the datasource, they are archived in MarkLogic Server. To archive documents we remove the collection for this datasource and add an "archived" collection. This effectively removes the documents from queries that are looking for documents in the collection for this datasource. This is how we stay up-to-date with deletes when the source system offers no way to track deleted documents. # Alternative Solutions ## Alternative Solution 1 If your scenario allows you to load all the documents each time, do that because it's simpler. Simply delete in the target all data from that one source then reload the latest data from that source. This addresses new documents, updated documents, and deleted documents. ## Alternative Solution 2 Your scenario may be different if it requires a one-time data migration rather than an ongoing load of updates from the source. For example, a one-time load for a production cut-over may have significant performance requirements this solution cannot address. Also, some one-time migrations will not require comparison of hashcodes nor tracking of deletes. # Adjustments # Solution Adjustment 1 If the source can provide you with last updated timestamps, compare those instead of hashcodes. This reduces the effort to select which portions of the document to include in the hashcode. This also reduces the processing of calculating hashcodes each time. # Solution Adjustment 2 The sidecar document can be written to a different MarkLogic database, cluster, or non-MarkLogic system (including the file system). This will reduce the read load on the database with the actual document contents. This also opens more options to write sidecar to a database with a different configuration including forests on less expensive storage. # Solution Adjustment 3 For systems that offer a way to track deleted documents, use that instead of step 3. Get the list of uris of source documents deleted since the last job run. Archive or delete those documents (and associated sidecar files) from MarkLogic Server. # Solution Adjustment 4 The source documents can be read from a staging area containing at least the uri and the up-to-date hashcode for each document. This will reduce the read load on the source system to only documents found to be missing from MarkLogic or updated from what is in MarkLogic. # Gotchas ## No Staging of Source Documents in Target We recommend loading documents to a staging area in MarkLogic without transformations so we can see the documents in MarkLogic as they look in the source system. If we don't do that, and we transform the documents in MarkLogic, it may be confusing how to calculate hashcodes. Nevertheless, this pattern can still be applied, it just requires more careful design and documentation so it can reasonably be maintained. ## Documents are not 1-1 from Source to Target Not all documents (or records, or rows) from a source system map 1-1 to final documents in a target system. This may make it less obvious how to apply this pattern. Sometimes mapping source documents to target documents occurs client-side. Sometimes mapping source documents to target documents happens server-side, as in the Data Hub Framwork. One key to resolving this is to generate hashcodes that help determine whether relevant source data changed, so hashcodes should incorporate all relevant source data but not data generated solely by transformations (or harmonization). When all relevant source data comes from multiple records, and no staging documents match source documents, the source records must of course be combined prior to calculating hashcodes, as we do in this example. Here we perform a join in the source relational database to combine all relevant data into multiple rows. Additionally, we combine multiple rows into a single Employee object before we calculate the hashcodes.

Field Summary

Fields
Modifier and Type	Field	Description
`static java.text.SimpleDateFormat`	`dateFormat`
`static DataMovementManager`	`moveMgr`

Constructor Summary

Constructors
Constructor Description

IncrementalLoadFromJdbc()

Constructors
Constructor	Description
`IncrementalLoadFromJdbc()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`static void`	`main(java.lang.String[] args)`
`void`	`processEmployee(WriteBatcher wb, Employee employee)`
`void`	`run()`

Methods inherited from class com.marklogic.client.example.cookbook.datamovement.BulkLoadFromJdbcWithSimpleJoins
addRow, denormalize, populateEmployee, populateSalary, populateTitle

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

moveMgr

public static final DataMovementManager moveMgr

dateFormat

public static final java.text.SimpleDateFormat dateFormat

Constructor Detail
- IncrementalLoadFromJdbc
```
public IncrementalLoadFromJdbc()
```

Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException,
                        java.sql.SQLException

Throws:: java.io.IOException; java.sql.SQLException

processEmployee

public void processEmployee(WriteBatcher wb,
                            Employee employee)

Overrides:: processEmployee in class BulkLoadFromJdbcWithSimpleJoins

run
```
public void run()
         throws java.io.IOException,
                java.sql.SQLException
```
Overrides:

run in class BulkLoadFromJdbcWithSimpleJoins

Throws:

java.io.IOException

java.sql.SQLException

Class IncrementalLoadFromJdbc

Field Summary

Constructor Summary

Method Summary

Methods inherited from class com.marklogic.client.example.cookbook.datamovement.BulkLoadFromJdbcWithSimpleJoins

Methods inherited from class java.lang.Object

Field Detail

moveMgr

dateFormat

Constructor Detail

IncrementalLoadFromJdbc

Method Detail

main

processEmployee

run