This chapter provides an overview of the triple index in MarkLogic Server and includes the following sections:
The triple index is used to index schema-valid sem:triple elements found anywhere in a document. The indexing of triples is performed when documents containing triples are ingested into MarkLogic or during a database reindex. The triple index stores each unique value only once, in the dictionary. The dictionary gives each value an ID, and the triple data then uses that ID to reference the value in the dictionary.
The validity of sem:triple elements is determined by checking elements and attributes in the documents against the sem:triple schema (/MarkLogic/Config/semantics.xsd
). If the sem:triple element is valid, an entry is created in the triple index, otherwise the element is skipped. Unlike range indexes, triple indexes do not have to fit in memory, so there is little up-front memory allocation.
For all new installations of MarkLogic 9 and later, the triple index and collection lexicon are enabled by default. Any new databases will also have the triple index and collection lexicon enabled.
This section covers the following topics:
Internally, MarkLogic stores triples in two ways: triple values and triple data. The triple values are the individual values from every triple, including all typed literal, IRIs, and blank nodes. The triple data holds the triples in different permutations, along with a document ID and position information. The triple data refer to the triple values by ID, making for very efficient lookup. Triple data is stored compressed on disk, and triple values are stored in a separate compressed value store. Both the triple index and the value store are stored in compressed four-kilobyte (4k) blocks.
When triple data is needed (for example, during a lookup), the relevant block is cached in either the triple cache or the triple value cache. Unlike other MarkLogic caches, the triple cache and triple value cache shrinks and grows, only taking up memory when it needs to add to the caches.
You can configure the size of the triple cache and the triple value cache for the host of your triple store, as described in Sizing Caches.
The triple cache holds blocks of compressed triples from disk which are flushed using a least recently used (LRU) algorithm. Blocks in the triple cache refer to values from a dictionary. The triple value cache holds uncompressed values from the triple index dictionary. The triple value cache is also an LRU cache.
Triples in the triple index are filtered out depending on the timestamps of the query and of the document from which they came. The triple cache holds information generated before the filtering happens, so deleting a triple has no effect on triple caches. However, after a merge, old stands may be deleted. When a stand is deleted, all its blocks are flushed from the triple caches.
Cache timeout controls how long MarkLogic Server will keep triple index blocks in the cache after the last time it was used (when it has not been flushed to make room for another block). Increasing the cache timeout might be good for keeping the cache hot for queries that are run at infrequent periods. Other more frequent queries may push the information out of the cache before the infrequent query is re-run.
Values are stored in a separate value store on disk in value equality sorted order, so in a given stand, the value ID order is equivalent to value equality order.
Strings in the values are stored in the range index string storage. Anything not relevant to value equality is removed from the stored values, such as timezone and derived type information.
Since type information is stored separately, triples can be returned directly from the triple index. This information is also used for RDF-specific sameTerm comparison required by SPARQL simple entailment.
The triple positions index is used to accurately resolve queries that use cts:triple-range-query and the item-frequency
option of cts:triples. The triple positions index is also used to accurately resolve searches that use the cts:near-query and cts:element-query constructors. The triple positions index stores locations within a fragment of the relative positions of triples within that fragment (typically, a fragment is a document). Enabling the triple positions index increases index sizes and somewhat slows document loads, but it increases the accuracy of queries that need those positions.
xquery version "1.0-ml"; cts:search(doc(), cts:near-query(( cts:triple-range-query(sem:iri("http://www.rdfabout.com/rdf/ usgov/sec/id/cik0001075285"), (), ()), cts:triple-range-query(sem:iri("http://www.rdfabout.com/rdf/ usgov/sec/id/cik0001317036"), (), ()) ),11), "unfiltered")
The cts:near-query returns a sequence of queries to match, where the matches occur within the specified distance from each other. The distance specified is in the number of words between any two matching queries.
The unfiltered search selects fragments from the indexes that are candidates to satisfy the specified cts:query
and returns the document.
To efficiently make use of memory, the index files for triple and value stores are directly mapped into memory. The type store is entirely mapped into memory.
Both the triple and value stores have index files consisting of 64-byte segments. The first segment in each is a header containing checksums, version number, and counts (of triples or values). This is followed by:
The triple index stores positions if the triple positions
is enabled. See Enabling the Triple Index.
The type store has an index file that stores the offset into the type data file for each stored type. This is also mapped into memory.
This table describes the memory-mapped index files that store information used by the triple indexes and values stores.
The permutation enumeration details the role each value plays in the original triple. Three permutations are stored in order to provide access to different sort orders, and to be able to efficiently look up different parts of the triple. The permutations are acronyms made up from the initials of the three RDF elements (subject, predicate, and object), for example:{ SOP, PSO, OPS
}.
Use the cts:triples function to specify one of these sort orders in the options:
By default, the triple index is enabled for databases in MarkLogic 9 or later. This section discusses how to enable the triple index or verify that it is enabled. It also discusses related indexes and configuration settings. It includes the following topics:
The triple index can be enabled or disabled on the Admin Interface (http://hostname:8001
) database configuration page. The hostname is the MarkLogic Server host for which the triple index is to be enabled.
For more information about index settings, see Index Settings that Affect Documents of the Administrator's Guide and Configuring the Database to Work with Triples.
For all new installations of MarkLogic 9 and later, the triple index is enabled by default. Any new databases will also have the triple index enabled. You may want to verify that existing databases have the triple index enabled.
Use the following procedures to verify or configure the triple index and related settings. To enable the triple positions index, the in-memory triple index size, and collection lexicon, use the Admin interface (http://hostname:8001
) or the Admin API. See Using the Admin API for details.
When you enable the triples index for the first time, or if you are reindexing your database after enabling the triple index, only documents containing valid sem:triple
elements are indexed.
cts:triple-range-query
.It is not necessary to enable the triple position index for querying with native SPARQL.
When you change any index settings for a database, the new settings will take effect based on whether reindexing is enabled (reindexer enable
set to true
).
Use these Admin API functions to enable the triple index, triple index positions, and configure the in-memory triple index size for your database:
This example sets the triple index of Sample-Database to true
using the Admin API:
xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; (: Get the configuration :) let $config := admin:get-configuration() (: Obtain the database ID of 'Sample-Database' :) let $Sample-Database := admin:database-get-id( $config, "Sample-Database") let $c := admin:database-set-triple-index($config, $Sample-Database, fn:true()) return admin:save-configuration($c)
This example uses the Admin API to set the triple positions of the database to true
:
xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; let $config := admin:get-configuration() let $Sample-Database := admin:database-get-id( $config, "Sample-Database") let $c := admin:database-set-triple-positions($config, $Sample-Database, fn:true()) return admin:save-configuration($c)
This example sets the in-memory triple index size of the database to 256MB:
xquery version "1.0-ml"; import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy"; let $config := admin:get-configuration() let $Sample-Database := admin:database-get-id( $config, "Sample-Database") let $c := admin:database-set-in-memory-triple-index-size($config, $Sample-Database, 256) return admin:save-configuration($c)
For details about the function signatures and descriptions, see the admin:database
functions (database) in the XQuery and XSLT Reference Guide.
This section includes the following topics:
The triple cache and the triple value cache are d-node caches, which are partitioned for lock contention. This partitioning enables parallelism and speeds up processing.
The maximum sizes of the caches and number of partitions are configurable. To change the triple or triple value cache sizes for the host, you can use the Groups configuration page in the Admin Interface or use the Admin API.
In the Admin Interface (http://hostname:8001
) on the Groups configuration page, specify values for caches sizes, partitions, and timeouts:
This table describes the Admin API functions for group cache configurations:
Function | Description |
---|---|
admin:group-set-triple-cache-size | Changes the triple cache size setting of the group with the specified ID to the specified value |
admin:group-set-triple-cache-partitions | Changes the triple cache partitions setting of the group with the specified ID to the specified value |
admin:group-set-triple-cache-timeout | Changes the number of seconds a triple block can be unused before being flushed from caches |
admin:group-set-triple-value-cache-timeout | Changes the number of seconds a triple value block can be unused before being flushed from caches |
admin:group-set-triple-value-cache-size | Changes the triple value cache size setting of the group with the specified ID to the specified value |
admin:group-set-triple-value-cache-partitions | Changes the triple value cache partitions setting of the group with the specified ID to the specified value |
During a merge, triple values and types may become unused by the triple index. To merge the triple index in a single streaming pass, type and value stores are merged before the triples. Unused values and types are identified during the merge of the triples. During the next merge, the unused types and values identified are be removed, releasing the space they previously used.
For best compaction, two merges are needed. This is not an issue in normal operations because MarkLogic Server is designed to periodically merge.
Since the type store is ordered by frequency, it is merged entirely in memory. The value and triple stores are merged in a streaming fashion, from and to disk directly.
For more information about merging, see Understanding and Controlling Database Merges in the Administrator's Guide.
Since SPARQL execution does not fetch fragments, there is the potential to scale back on expanded and compressed tree caches on triple-only deployments. You can configure tree caches from the Group configuration page in the Admin Interface, or by using these functions:
admin:group-set-expanded-tree-cache-size admin:group-set-compressed-tree-cache-size
You can monitor the status of the database and forest from the database Status page in the Admin Interface:
http://hostname:8001/
You can also use the MarkLogic monitoring tools, Monitoring Dashboard and Monitoring History:
http://hostname:8002/dashboard http://hostname:8002/history
For more information, see Using the MarkLogic Server Monitoring Dashboard in the Monitoring MarkLogic Guide.
You can also use these functions for query metrics and to monitor the status of forests and caches:
xdmp:forest-status
- Cache hits or misses, hit rate, and miss rate for each stand