public class IncrementalLoadFromJdbc
extends BulkLoadFromJdbcWithSimpleJoins
Demonstrates a way to load a massive volume of
data updates from JDBC into MarkLogic, assuming the source cannot
identify changes and deletes. In this example source data (accessed
via JDBC) continues to grow and evolve, so updates from the source
must be regularly incporated into the target system (MarkLogic
Server). These updates include new documents, updated documents,
and deleted documents. The source data is too large to ingest
completely every time. So this example addresses the more difficult
scenario where incremental loads are required to include only the
updates. Many source systems offer a document version or last
updated time-stamp. This pattern addresses the more difficult
scenario where the source system offers no such option.
Additionally, this example addresses the more difficult scenario
where the source system can provide a list of all current document
uris, but cannot provide any information about modified or deleted
documents. # Solution ## Step 1 The process begins reading
documents directly from JDBC, adding each to uriQueue and
sourceEmployees. One batch at a time the uris from uriQueue are
used to retrieve hashcodes from the target. For each source
document the hashcode is generated in memory and compared to the
hashcode from the target (if available). Documents with no hashcode
in the target are considered new and written to the target.
Documents with a hashcode different from the target are considered
updated and written to the target. In all cases a sidecar document
including the current source hashcode is written to the target.
This is all done in batches to reduce overhead on the application,
source, and target systems. In addition, DMSDK processes batches in
multiple threads and against multiple MarkLogic hosts to fully
utilize the MarkLogic cluster. ## Step 2 Any document written to
MarkLogic Server also has written a "sidecar" document containing
metadata including the document uri, a hashcode and a jobName. The
sidecar document has a collection representing the data source. The
hascode is generated based on select portions of the source
document contents. The hascode algorithm is consistent when the
source document hasn't changed and different any time the source
document has changed. The jobName is any id or timestamp
representing the last job which checked the hashcode of the
document, and should differ from previous job runs. This sidecar
document is updated with each job run to reflect the latest
jobName. ## Step 3 As the last step of a job run, a query returns
all sidecar files with the collection for this datasource but a
jobName different than the current jobName which indicates these
documents are in MarkLogic but were missing from this job run and
are therefore not in the datasource. After confirming that these
documents are legitimately not in the datasource, they are archived
in MarkLogic Server. To archive documents we remove the collection
for this datasource and add an "archived" collection. This
effectively removes the documents from queries that are looking for
documents in the collection for this datasource. This is how we
stay up-to-date with deletes when the source system offers no way
to track deleted documents. # Alternative Solutions ## Alternative
Solution 1 If your scenario allows you to load all the documents
each time, do that because it's simpler. Simply delete in the
target all data from that one source then reload the latest data
from that source. This addresses new documents, updated documents,
and deleted documents. ## Alternative Solution 2 Your scenario may
be different if it requires a one-time data migration rather than
an ongoing load of updates from the source. For example, a one-time
load for a production cut-over may have significant performance
requirements this solution cannot address. Also, some one-time
migrations will not require comparison of hashcodes nor tracking of
deletes. # Adjustments # Solution Adjustment 1 If the source can
provide you with last updated timestamps, compare those instead of
hashcodes. This reduces the effort to select which portions of the
document to include in the hashcode. This also reduces the
processing of calculating hashcodes each time. # Solution
Adjustment 2 The sidecar document can be written to a different
MarkLogic database, cluster, or non-MarkLogic system (including the
file system). This will reduce the read load on the database with
the actual document contents. This also opens more options to write
sidecar to a database with a different configuration including
forests on less expensive storage. # Solution Adjustment 3 For
systems that offer a way to track deleted documents, use that
instead of step 3. Get the list of uris of source documents deleted
since the last job run. Archive or delete those documents (and
associated sidecar files) from MarkLogic Server. # Solution
Adjustment 4 The source documents can be read from a staging area
containing at least the uri and the up-to-date hashcode for each
document. This will reduce the read load on the source system to
only documents found to be missing from MarkLogic or updated from
what is in MarkLogic. # Gotchas ## No Staging of Source Documents
in Target We recommend loading documents to a staging area in
MarkLogic without transformations so we can see the documents in
MarkLogic as they look in the source system. If we don't do that,
and we transform the documents in MarkLogic, it may be confusing
how to calculate hashcodes. Nevertheless, this pattern can still be
applied, it just requires more careful design and documentation so
it can reasonably be maintained. ## Documents are not 1-1 from
Source to Target Not all documents (or records, or rows) from a
source system map 1-1 to final documents in a target system. This
may make it less obvious how to apply this pattern. Sometimes
mapping source documents to target documents occurs client-side.
Sometimes mapping source documents to target documents happens
server-side, as in the Data Hub Framwork. One key to resolving this
is to generate hashcodes that help determine whether relevant
source data changed, so hashcodes should incorporate all relevant
source data but not data generated solely by transformations (or
harmonization). When all relevant source data comes from multiple
records, and no staging documents match source documents, the
source records must of course be combined prior to calculating
hashcodes, as we do in this example. Here we perform a join in the
source relational database to combine all relevant data into
multiple rows. Additionally, we combine multiple rows into a single
Employee object before we calculate the hashcodes.