MarkLogic is a database designed from the ground up to make massive quantities of heterogenous data easily accessible through search. The design philosophy behind the evolution of MarkLogic is that storing data is only part of the solution. The data must also be quickly and easily retrieved and presented in a way that makes sense to different types of users. Additionally, the data must be reliably maintained by an enterprise grade, scalable software solution that runs on commodity hardware. The purpose of this guide is to describe the mechanisms in MarkLogic that are used to achieve these objectives.
MarkLogic fuses together database internals, search-style indexing, and application server behaviors into a unified system. It uses XML and JSON documents as its data model, and stores the documents within a transactional repository. It indexes the words and values from each of the loaded documents, as well as the document structure. And, because of its unique Universal Index, MarkLogic does not require advance knowledge of the document structure and adherence to a particular schema. Through its application server capabilities, it is programmable and extensible.
MarkLogic clusters on commodity hardware using a shared-nothing architecture and supports massive scale, high-availability, and very high performance. Customer deployments have scaled to hundreds of terabytes of source data while maintaining sub-second query response time.
The main characteristic of the relational data model used in a typical relational database managment system (RDBMS) is the need for a schema to define what data is to be stored, how the data is to be categorized into tables and the relationship between those tables in the database. In MarkLogic, documents are the data and there is no need for a schema. However, in MarkLogic, you can define a document data model (the basic structure of the XML or JSON in your documents) and configure indexes that leverage the data model to improve the speed and precision of your search queries. Your document data model can be as 'loose' or as 'strict' as you like. Unlike an RDBMS schema, one of the major benefits of a document data model is that it can be created after the documents are loaded into MarkLogic and then later refined as your needs evolve.
Another characteristic of the relational data model is normalization, meaning that related data is distributed across separate, interrelated tables. Normalization has the benefit of minimizing duplicate data, but the downside is that it is necessary for the database to join the data in the related tables together in response to certain queries. Such joins are time-consuming, so most modern databases have 'denormalization' techniques that minimize the need to repeatedly do joins for often-used queries. A materialized view is an example of denormalized data. Though denormalizing the data is preferable to doing repeated joins on the data, there is still a performance cost.
In a document data model, related data is typically all contained in the same document, so the data is already denormalized. Financial contracts, medical records, legal filings, presentations, blogs, tweets, press releases, user manuals, books, articles, web pages, metadata, sparse data, message traffic, sensor data, shipping manifests, itineraries, contracts, emails, and so on are all naturally modeled as documents. In some cases the data might start formatted as XML or JSON documents (for example, Microsoft Office 2007 documents or financial products written in FpML), but if not, documents can be transformed to XML or JSON during ingestion. Relational databases, in contrast, with their table-centric data models, cannot represent data like this as naturally and so either have to spread the data out across many tables (adding complexity and hurting performance) or keep this data as unindexed BLOBs or CLOBs.
For those familiar with RDBMS schemas, the term 'schema' can be confusing when discussing XML content. Though the data in MarkLogic Server does not have to conform to a schema as understood in the RDBMS context, you can define 'schemas' for your XML content to describe a content model for a class of documents. This model defines the expected elements and attributes within XML documents and their permitted structure. Documents that do not conform to the schema are considered invalid and are rejected during ingestion into MarkLogic Server. MarkLogic does not support schemas for JSON content.
Schemas are useful in situations where certain documents must contain specific content and conform to a particular structure. For example, the schema for a company's employee directory might describe a valid
<Employee> as consisting of an
<EmployeeID> element, followed by a
<Position> element. The content of an
<EmployeeID> might have a datatype constraint that it consist of a sequence of exactly five digits.
XML schemas are loaded into the Schemas database or placed in the
Config directory. You can then configure a group of App Servers or a specific App Server to use these schemas. Each XML schema is identified by a namespace. To use a schema, you declare its associated namespace in your documents and code.
For more detail on custom XML schemas, see Understanding and Defining Schemas in the Administrator's Guide and Loading Schemas in the Application Developer's Guide. MarkLogic also supports 'virtual' SQL schemas that provide the naming context for SQL views, as described in the SQL Data Modeling Guide.
MarkLogic stores documents within its own transactional repository that was purpose-built with a focus on maximum performance and data integrity. Because the MarkLogic database is transactional, you can insert or update a set of documents as an atomic unit and have the very next query able to see those changes with zero latency. MarkLogic supports the full set of ACID properties:
MarkLogic Server search supports a wide range of full-text features. These features include phrase search, boolean search, proximity, stemming, tokenization, decompounding, wildcarded searches, punctuation-sensitive search, diacritic-sensitive/insensitive searches, case-sensitive/insensitive searches, spelling correction functions, thesaurus functions, geospatial searches, advanced language and collation support, document quality settings, numerous relevance algorithms, individual term weighting, topic clustering, faceted navigation, custom-indexed fields, and much more. These features are all designed to build off of each other and work together in an extensible and flexible way.
Search is covered in more detail in Searching in MarkLogic Server.
MarkLogic indexes both text and structure, and the two can be queried together efficiently. For example, consider the challenge of querying and analyzing intercepted message traffic for threat analysis:
Find all messages sent by IP 126.96.36.199 between April 11th and April 13th where the message contains both 'wedding cake' and 'empire state building' (case and punctuation insensitive) where the phrases have to be within 15 words of each other but the message cannot contain another key phrase such as 'presents' (stemmed so 'present' matches also). Exclude any message that has a subject equal to 'Congratulations.' Also exclude any message where the matching phrases were found within a quote block in the email. Then, for matching messages, return the most frequent senders and recipients.
By using XML and/or JSON documents to represent each message and the structure-aware indexing to understand what is an IP, what is a date, what is a subject, and which text is quoted and which is not, a query like this is actually easy to write and highly performant in MarkLogic. Or consider some other examples.
Extract all large-format images from the 10 research articles most relevant to the phrase ‘herniated disc.' Relevance should be weighted so that phrase appearance in a title is 5 times more relevant than body text, and appearance in an abstract is 2 times more relevant.
From a large corpus of emails find those sent by a particular user, sort them reverse chronological, and locate the last email they sent which had a footer block containing a phone number. Return the phone number.
Indexing is covered in more detail in Indexing in MarkLogic.
Semantic technologies refer to a family of W3C standards that allow the exchange of data (and information about relationships in data) in machine readable form, whether it resides on the Web or within organizations. Semantics requires a flexible data model (Resource Description Format or RDF), a query language (SPARQL), a language to manage RDF data (SPARQL Update), and a common markup language for the data (such as Turtle, RDFa, or N-Triples).
For details about MarkLogic semantics, see Introduction to Semantics in MarkLogic in the Semantics Developer's Guide. For more information on semantic queries, see Semantic Searches. For details on the triple index, see Triple Index. For information on SPARQL Update, see Managing Semantic Triples.
Binary documents require special consideration in MarkLogic Server because binary content is often much larger than text, JSON, or XML content. MarkLogic Server provides a number of mechanisms that enable you optimize the management of binary data. You can set thresholds that determine where a binary document is stored, depending on its size. For example, smaller binary documents can be efficiently stored in the same manner as non-binary documents. Larger binaries can be stored in a special Large Data Directory, which may use a different class of disk than your regular filesystem. The largest binary documents are typically stored outside MarkLogic Server in external storage. When binary documents are stored externally, pointers are maintained by MarkLogic to the documents so it is easy to manage security on them within MarkLogic.
When a small binary is cached by MarkLogic Server, the entire document is cached in memory. When a large or external binary is cached, the content is fetched into a compressed tree cache either fully or in chunks, as needed.
MarkLogic operates as a single multi-threaded process per host. It opens various socket ports for external communication. When configuring new socket ports for your application to use, you can choose between three distinct protocols:
XDBC enables programmatic access to MarkLogic from other language contexts, similar to what JDBC and ODBC provide for relational databases. MarkLogic officially supports Java and .NET client libraries, named XCC. There are open source libraries in other languages. XDBC and the XCC client libraries make it easy to integrate MarkLogic into an existing application stack.
WebDAV is a protocol that lets a MarkLogic repository look like a filesystem to WebDAV clients, of which there are many including built-in clients in most operating systems. With a WebDAV mount point you can drag-and-drop files in and out of MarkLogic as if it were a network filesystem. This can be useful for small projects; large projects usually create an ingestion pipeline and send data over XDBC or HTTP.
MarkLogic is designed to maximize speed and scale. Many MarkLogic applications compose advanced queries across terabytes of data that make up many millions of documents and return answers in less than a second. The largest live deployments now exceed 200 terabytes and a billion documents. The largest projects now under development will exceed a petabyte.
For more detail on performance, see the Query Performance and Tuning Guide.
To achieve speed and scale beyond the capabilities of one server, MarkLogic clusters across commodity hardware connected on a LAN. A commodity server might be a box with 4 or 8 cores, 64 or 128 gigabytes of RAM or more, and either a large local disk or access to a SAN. On a box such as this a rule of thumb is you can store roughly 1 terabyte of data, sometimes more and sometimes less, depending on your use case.
MarkLogic hosts are typically configured to specialize in one of two operations. Some hosts (Data Managers, or D-nodes) manage a subset of data. Other hosts (Evaluators, or E-nodes) handle incoming user queries and internally federate across the D-nodes to access the data. A load balancer spreads queries across E-nodes. As you load more data, you add more D-nodes. As your user load increases, you add more E-nodes. Note that in some cluster architecture designs, some hosts may act as both a D-node and an E-node. In a single-host environment that is always the case.
Clustering enables high availability. In the event an E-node should fail, there is no host-specific state to lose, just the in-process requests (that can be retried), and the load balancer can route traffic to the remaining E-nodes. Should a D-node fail, that subset of the data needs to be brought online by another D-node. You can do this by using either a clustered filesystem (allowing another D-node to directly access the failed D-node's storage and replay its journals) or intra-cluster data replication (replicating updates across multiple D-node disks, providing in essence a live backup).
For more detail on clustering, see Clustering and Caching.
MarkLogic clusters can be deployed on Amazon Elastic Compute Cloud (EC2). An EC2 deployment of MarkLogic enables you to quickly get started with any sized cluster (or clusters) with a minimum upfront investment. You can then scale your clusters up or down, as your needs evolve.
In the EC2 environment, a MarkLogic Server is deployed as an EC2 instance. MarkLogic instances of various sizes and capabilities, as well as their related resources, can be easily created by using the Cloud Formation templates available from http://developer.marklogic.com/products/aws. EC2 provides elastic load balancers (ELBs) to automatically distribute and balance application traffic among multiple EC2 instances of MarkLogic, along with a wide range of tools and services to manage MarkLogic clusters in the EC2 environment.
In EC2, data centers are physically distributed into regions and zones. Each region is a separate geographic area, such as US west and US east. Each region contains multiple zones, which are separate data centers in the same geological area that are connected through low-latency links. To ensure high availability in the EC2 environment, you can place D-Nodes in different availability zones in the same region and configure them for local-disk failover to ensure that each transaction is written to one or more replicas. For optimum availability, D-Nodes and E-Nodes can be split evenly between two availability zones. For disaster recovery, you can place D-Nodes in different regions and use database replication between the D-Nodes in each region.
For more detail on deploying MarkLogic on EC2, see the MarkLogic Server on Amazon EC2 Guide.