Loading TOC...
Semantic Graph Developer's Guide (PDF)

Semantic Graph Developer's Guide — Chapter 4

Triple Index Overview

This chapter provides an overview of the triple index in MarkLogic Server and includes the following sections:

Understanding the Triple Index and How It's Used

The triple index is used to index schema-valid sem:triple elements found anywhere in a document. The indexing of triples is performed when documents containing triples are ingested into MarkLogic or during a database reindex. The triple index stores each unique value only once, in the dictionary. The dictionary gives each value an ID, and the triple data then uses that ID to reference the value in the dictionary.

The validity of sem:triple elements is determined by checking elements and attributes in the documents against the sem:triple schema (/MarkLogic/Config/semantics.xsd). If the sem:triple element is valid, an entry is created in the triple index, otherwise the element is skipped. Unlike range indexes, triple indexes do not have to fit in memory, so there is little up-front memory allocation.

For all new installations of MarkLogic 9 and later, the triple index and collection lexicon are enabled by default. Any new databases will also have the triple index and collection lexicon enabled.

This section covers the following topics:

Triple Data and Value Caches

Internally, MarkLogic stores triples in two ways: triple values and triple data. The triple values are the individual values from every triple, including all typed literal, IRIs, and blank nodes. The triple data holds the triples in different permutations, along with a document ID and position information. The triple data refer to the triple values by ID, making for very efficient lookup. Triple data is stored compressed on disk, and triple values are stored in a separate compressed value store. Both the triple index and the value store are stored in compressed four-kilobyte (4k) blocks.

When triple data is needed (for example, during a lookup), the relevant block is cached in either the triple cache or the triple value cache. Unlike other MarkLogic caches, the triple cache and triple value cache shrinks and grows, only taking up memory when it needs to add to the caches.

You can configure the size of the triple cache and the triple value cache for the host of your triple store, as described in Sizing Caches.

Triple Cache and Triple Value Cache

The triple cache holds blocks of compressed triples from disk which are flushed using a least recently used (LRU) algorithm. Blocks in the triple cache refer to values from a dictionary. The triple value cache holds uncompressed values from the triple index dictionary. The triple value cache is also an LRU cache.

Triples in the triple index are filtered out depending on the timestamps of the query and of the document from which they came. The triple cache holds information generated before the filtering happens, so deleting a triple has no effect on triple caches. However, after a merge, old stands may be deleted. When a stand is deleted, all its blocks are flushed from the triple caches.

Cache timeout controls how long MarkLogic Server will keep triple index blocks in the cache after the last time it was used (when it has not been flushed to make room for another block). Increasing the cache timeout might be good for keeping the cache hot for queries that are run at infrequent periods. Other more frequent queries may push the information out of the cache before the infrequent query is re-run.

Triple Values and Type Information

Values are stored in a separate value store on disk in value equality sorted order, so in a given stand, the value ID order is equivalent to value equality order.

Strings in the values are stored in the range index string storage. Anything not relevant to value equality is removed from the stored values, such as timezone and derived type information.

Since type information is stored separately, triples can be returned directly from the triple index. This information is also used for RDF-specific sameTerm comparison required by SPARQL simple entailment.

Triple Positions

The triple positions index is used to accurately resolve queries that use cts:triple-range-query and the item-frequency option of cts:triples. The triple positions index is also used to accurately resolve searches that use the cts:near-query and cts:element-query constructors. The triple positions index stores locations within a fragment of the relative positions of triples within that fragment (typically, a fragment is a document). Enabling the triple positions index increases index sizes and somewhat slows document loads, but it increases the accuracy of queries that need those positions.

For example:

xquery version "1.0-ml";

cts:search(doc(),
  cts:near-query((
  cts:triple-range-query(sem:iri("http://www.rdfabout.com/rdf/
    usgov/sec/id/cik0001075285"), (), ()),

  cts:triple-range-query(sem:iri("http://www.rdfabout.com/rdf/
    usgov/sec/id/cik0001317036"), (), ())
    ),11), "unfiltered")

The cts:near-query returns a sequence of queries to match, where the matches occur within the specified distance from each other. The distance specified is in the number of words between any two matching queries.

The unfiltered search selects fragments from the indexes that are candidates to satisfy the specified cts:query and returns the document.

Index Files

To efficiently make use of memory, the index files for triple and value stores are directly mapped into memory. The type store is entirely mapped into memory.

Both the triple and value stores have index files consisting of 64-byte segments. The first segment in each is a header containing checksums, version number, and counts (of triples or values). This is followed by:

  • Triples index: After the header segment, the triples index contains an index of the first two values and permutation of the first triple in each block, arranged into 64-byte segments. This is used to find the blocks needed to answer a given lookup based on values from the triple. Currently triples are not accessed by ordinal, so an ordinal index is not required.
  • Values Index: After the header segment, the values index contains an index of the first value in each block, arranged into 64-byte segments. The values index is used to find the blocks needed to answer a given lookup based on value. This is followed by an index of the starting ordinal for each block, which is used to find the block needed to answer a given lookup based on a value ID.

    The triple index stores positions if the triple positions is enabled. See Enabling the Triple Index.

The type store has an index file that stores the offset into the type data file for each stored type. This is also mapped into memory.

This table describes the memory-mapped index files that store information used by the triple indexes and values stores.

Index File Description
TripleIndex
TripleValueIndex
Block indexes for TripleData and TripleValueData
TripleTypeData
TripleTypeIndex
Type information for the triple values
StringData
StringIndex
AtomData
AtomIndex
Also used by the string-based range indexes
TripleValueFreqs
TripleValueFreqsIndex
Statistical information about the triples. The triple index keeps statistics on the triples for every value kept in the database.

Permutations

The permutation enumeration details the role each value plays in the original triple. Three permutations are stored in order to provide access to different sort orders, and to be able to efficiently look up different parts of the triple. The permutations are acronyms made up from the initials of the three RDF elements (subject, predicate, and object), for example:{ SOP, PSO, OPS }.

Use the cts:triples function to specify one of these sort orders in the options:

  • order-pso - Returns results ordered by predicate, then subject, then object
  • order-sop - Returns results ordered by subject, then object, then predicate
  • order-ops - Returns results ordered by object, then predicate, then subject

Enabling the Triple Index

By default, the triple index is enabled for databases in MarkLogic 9 or later. This section discusses how to enable the triple index or verify that it is enabled. It also discusses related indexes and configuration settings. It includes the following topics:

Using the Database Configuration Pages

The triple index can be enabled or disabled on the Admin Interface (http://hostname:8001) database configuration page. The hostname is the MarkLogic Server host for which the triple index is to be enabled.

For more information about index settings, see Index Settings that Affect Documents of the Administrator's Guide and Configuring the Database to Work with Triples.

For all new installations of MarkLogic 9 and later, the triple index is enabled by default. Any new databases will also have the triple index enabled. You may want to verify that existing databases have the triple index enabled.

Use the following procedures to verify or configure the triple index and related settings. To enable the triple positions index, the in-memory triple index size, and collection lexicon, use the Admin interface (http://hostname:8001) or the Admin API. See Using the Admin API for details.

  • In the Admin Interface, scroll down to the triple index setting and set it to true.

    When you enable the triples index for the first time, or if you are reindexing your database after enabling the triple index, only documents containing valid sem:triple elements are indexed.

  • You can enable the triple positions index for faster near searches using cts:triple-range-query.

    It is not necessary to enable the triple position index for querying with native SPARQL.

  • You can set the size of cache and buffer memory to be allocated for managing triple index data for an in-memory stand.

    When you change any index settings for a database, the new settings will take effect based on whether reindexing is enabled (reindexer enable set to true).

Using the Admin API

Use these Admin API functions to enable the triple index, triple index positions, and configure the in-memory triple index size for your database:

This example sets the triple index of Sample-Database to true using the Admin API:

xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin" at
  "/MarkLogic/admin.xqy";

(: Get the configuration :)
let $config := admin:get-configuration()

(: Obtain the database ID of 'Sample-Database' :)
let $Sample-Database := admin:database-get-id(
  $config, "Sample-Database")
let $c := admin:database-set-triple-index($config, $Sample-Database, fn:true())
return admin:save-configuration($c)

This example uses the Admin API to set the triple positions of the database to true:

xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin" at
"/MarkLogic/admin.xqy";

let $config := admin:get-configuration()
let $Sample-Database := admin:database-get-id(
  $config, "Sample-Database")
let $c := admin:database-set-triple-positions($config,
  $Sample-Database, fn:true())
return admin:save-configuration($c)

This example sets the in-memory triple index size of the database to 256MB:

xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin" at
"/MarkLogic/admin.xqy";

let $config := admin:get-configuration()
let $Sample-Database := admin:database-get-id(
  $config, "Sample-Database")
let $c := admin:database-set-in-memory-triple-index-size($config,
  $Sample-Database, 256)
return admin:save-configuration($c)

For details about the function signatures and descriptions, see the admin:database functions (database) in the XQuery and XSLT Reference Guide.

Other Considerations

This section includes the following topics:

Sizing Caches

The triple cache and the triple value cache are d-node caches, which are partitioned for lock contention. This partitioning enables parallelism and speeds up processing.

The maximum sizes of the caches and number of partitions are configurable. To change the triple or triple value cache sizes for the host, you can use the Groups configuration page in the Admin Interface or use the Admin API.

In the Admin Interface (http://hostname:8001) on the Groups configuration page, specify values for caches sizes, partitions, and timeouts:

This table describes the Admin API functions for group cache configurations:

Function Description
admin:group-set-triple-cache-size Changes the triple cache size setting of the group with the specified ID to the specified value
admin:group-set-triple-cache-partitions Changes the triple cache partitions setting of the group with the specified ID to the specified value
admin:group-set-triple-cache-timeout Changes the number of seconds a triple block can be unused before being flushed from caches
admin:group-set-triple-value-cache-timeout Changes the number of seconds a triple value block can be unused before being flushed from caches
admin:group-set-triple-value-cache-size Changes the triple value cache size setting of the group with the specified ID to the specified value
admin:group-set-triple-value-cache-partitions Changes the triple value cache partitions setting of the group with the specified ID to the specified value

Unused Values and Types

During a merge, triple values and types may become unused by the triple index. To merge the triple index in a single streaming pass, type and value stores are merged before the triples. Unused values and types are identified during the merge of the triples. During the next merge, the unused types and values identified are be removed, releasing the space they previously used.

For best compaction, two merges are needed. This is not an issue in normal operations because MarkLogic Server is designed to periodically merge.

Since the type store is ordered by frequency, it is merged entirely in memory. The value and triple stores are merged in a streaming fashion, from and to disk directly.

For more information about merging, see Understanding and Controlling Database Merges in the Administrator's Guide.

Scaling and Monitoring

Since SPARQL execution does not fetch fragments, there is the potential to scale back on expanded and compressed tree caches on triple-only deployments. You can configure tree caches from the Group configuration page in the Admin Interface, or by using these functions:

admin:group-set-expanded-tree-cache-size
admin:group-set-compressed-tree-cache-size

You can monitor the status of the database and forest from the database Status page in the Admin Interface:

http://hostname:8001/

You can also use the MarkLogic monitoring tools, Monitoring Dashboard and Monitoring History:

http://hostname:8002/dashboard
http://hostname:8002/history

For more information, see Using the MarkLogic Server Monitoring Dashboard in the Monitoring MarkLogic Guide.

You can also use these functions for query metrics and to monitor the status of forests and caches:

  • xdmp:query-meters - Cache hits or misses for a query
  • xdmp:forest-status - Cache hits or misses, hit rate, and miss rate for each stand
  • xdmp:cache-status - Percentage busy, used, and free by cache partition
« Previous chapter
Next chapter »