Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 22

Collections

MarkLogic Server includes collections, which are groups of documents that enable queries to efficiently target subsets of content within a MarkLogic database.

Collections are described as part of the W3C XQuery specification, but their implementation is undefined. MarkLogic has chosen to emphasize collections as a powerful and high-performance mechanism for selecting sets of documents against which queries can be processed. This chapter introduces the collection() function, explains how collections are defined and accessed, and describes some of the basic performance characteristics with which developers should be familiar. This chapter includes the following sections:

The collection() Function

The collection() function can be used anywhere in your XQuery that the doc() or input() functions are used. The collection() function has the following signature:

fn:collection($URI as xs:string*) as node*

The MarkLogic Server implementation of the collection() function takes a sequence of URIs, so you can call the collection() function on one or more collections. The signature of the function in the W3C XQuery documentation only takes a single string. Also, the fn namespace is built-in to MarkLogic Server, so it is not necessary to prefix the function with its namespace.

To illustrate what the collection() function is used for, consider the following two XPath expressions:

fn:doc()//sonnet/line[cts:contains(., "flower")]
collection("english-lit/shakespeare")//sonnet/
                                    line[cts:contains(., "flower")]

The first expression returns a sequence of line nodes, each of which must be the child of a sonnet node, and each of which must contain the term flower, matched on a case-insensitive basis.

The second expression returns the same sequence, except that only line nodes contained within documents that are members of the english-lit/shakespeare collection. MarkLogic Server optimizes this expression. The operation that uses the collection() function, along with the rest of the XPath expression, is executed very efficiently through a series of index lookups.

As mentioned previously, the collection() function accepts either a single collection, as illustrated above, or a sequence of collections, as illustrated below:

collection(("english-lit/shakespeare", 
            "american-lit/poetry"))//sonnet/
                             line[cts:contains(., "flower")]

The query above returns a sequence of line nodes that match the stated predicates that are members of either the english-lit/shakespeare collection or the american-lit/poetry collection or both. With this modification to the collection() function, its format now closely matches the format of the doc() function, which also takes a sequence of URIs. While there is currently no XPath-level support for more complex boolean membership conditions (for example, requiring membership in multiple collections (and), excluding documents that belong to certain collections (not) or requiring pure either-or membership (exclusive or)), you can achieve these conditions through the where clause in a surrounding FLWOR expression (see Collection Membership for an example).

Collections Versus Directories

Collections are used to organize documents in a database. You can also use directories to organize documents in a database. The key differences in using collections to organize documents versus using directories are:

  • Collections do not require member documents to conform to any URI patterns. They are not hierarchical; directories are. Any document can belong to any collection, and any document can also belong to multiple collections.
  • You can delete all documents in a collection with the xdmp:collection-delete function. Similarly, you can delete all documents in a directory (as well as all recursive subdirectories and any documents in those directories) with the xdmp:directory-delete function.
  • You cannot set properties on a collection; you can on a directory.

Except for the fact that you can use both collections and directories to organize documents, collections are unrelated to directories. For details on directories, see Properties Documents and Directories in the Application Developer's Guide.

Defining Collections

Collection membership for a document is defined implicitly. Rather than describing collections top-down (that is, specifying the list of documents that belong to a given collection), MarkLogic Server determines membership in a bottoms-up fashion, by aggregating the set of documents that describe themselves as being a member of the collection. You can use MarkLogic Server's security scheme to manage policies around collection membership.

Collections are named using URIs. Any URI is a legal name for a collection. The URI must be unique within the set of collections (both protected and unprotected) in your database.

The URIs that are used to name collections serve only as identifiers to the server. In particular, collections are not modeled on filesystem directories. Rather, collections are interpreted as sets, not as hierarchies. A document that belongs to collection english-lit/poetry/sonnets need not belong to collection english-lit/poetry. In fact, the existence of a collection with URI english-lit/poetry/sonnets does not imply the existence of collections with URI english-lit/poetry or URI english-lit.

There are two types of collections supported by MarkLogic Server: unprotected collections and protected collections. The two types are identical in terms of the syntactic application of the collection() function. However, differences emerge in the way they are defined, in who can access the collections, and in who can modify, add or remove documents from them. The following subsections descripe these two ways of defining collection:

Implicitly Defining Unprotected Collections

Unprotected collections are created implicitly.

When a document is first loaded into the system, the load directive (whether through XQuery or XDBC) optionally can specify the collections to which that document belongs. In that list of collections, the specification of a collection URI that has not previously been used is the only action that is needed to create that new unprotected collection.

If collections are left unspecified in the load directive, the document is added to the database with collection membership determined by the default collections that are defined for the current user through the security model and by inheritance from the current user's roles. The invocation of these default settings can also result in the creation of a new unprotected collection. If collections are left unspecified in the load directive and the current user has no default collections defined, the document will be added to the database without belonging to any collections.

In addition, once a document is loaded into the database, you can adjust its membership in collections with any of the following built-in XQuery functions (assuming you possess the appropriate permissions to modify the document in question):

  • xdmp:document-add-collections
  • xdmp:document-remove-collections
  • xdmp:document-set-collections

If a collection URI that is not otherwise used in the database is passed as a parameter to xdmp:document-add-collections or xdmp:document-set-collections, a new unprotected collection is created.

Unprotected collections disappear when there are no documents in the database that are members. Consequently, using xdmp:document-remove-collections, xdmp:document-set-collections or xdmp:document-delete may result in unprotected collections disappearing.

The xdmp:collection-delete function, which deletes every document in a database that belongs to a particular collection (assuming that the current user has the required permissions on a per-document basis), always results in the specified unprotected collection disappearing.

The xdmp:collection-delete function will delete all documents in a collection, regardless of their membership in other collections.

Explicitly Defining Protected Collections

Protected collections are created explicitly.

Protected collections afford certain security protections not available with unprotected collections (see Collections and Security). Consequently, rather than the implicit model described above, protected collections must be explicitly defined using the Admin Interface before any documents are assigned to that collection.

Once a protected collection and its security policies have been defined, documents can be added to that collection through the same mechanisms as described above for unprotected collections. However, in addition to the appropriate permissions to modify the document, the user also needs to have the appropriate permissions to modify the protected collection. The permissions on a protected collection do not provide document level security; they only prevent unprivileged users from adding documents to the collection.

Just as protected collections are created explicitly, the collection does not disappear if the state of the database changes and there are no documents currently belonging to that protected collection. To remove a protected collection from the database, the Admin Interface must be used to delete that collection's definition.

Collection Membership

As described above, the collections (unprotected and protected) to which a specific document belongs can be specified at load-time and can be modified once the document has been loaded into the database. Documents can belong to many collections simultaneously.

If specific collections are not defined at load-time, the server will automatically assign collection membership for the document based on both the user's and the user's aggregate roles' default collection membership settings. To load a document that does not belong to any collections, explicitly specify the empty sequence as the collections parameter.

Collection membership can be leveraged in any XPath expression that the collection(), doc(), or input() functions are used. In addition, collection membership for a particular document or node can be queried using the xdmp:document-get-collections built-in.

For example, the following expression returns a sequence of line nodes, each of which must be the child of a sonnet node, and each of which must contain the term flower, matched on a case-insensitive basis, that belong to either the english-lit/shakespeare collection or the american-lit/poetry collection or both:

collection(("english-lit/shakespeare", 
            "american-lit/poetry"))//sonnet/
                                line[cts:contains(., "flower")]

By contrast, the following expression returns a similar sequence of line nodes, except that the resulting nodes must belong to either the english-lit/poetry collection or the american-lit/poetry collection or both, but not to the english-lit/shakespeare collection:

for $line in collection(("english-lit/poetry", "american-lit/
     poetry"))//sonnet/line[cts:contains(., "flower")]
where xdmp:document-get-collections($line) != 
     "english-lit/shakespeare"
return $line

Collections and Security

Collections interact with the MarkLogic Server security model in three basic ways:

  • All users and roles can optionally specify default collections. These are the collections to which newly inserted documents are added if collections are not explicitly specified at load-time.
  • Adding a document to a collection--both at load-time and after the document has been loaded into the database--is contingent on the user possessing permissions to insert or update the document in question.
  • Removing a document from a collection and using xdmp:collection-delete are similarly contingent on the user's having appropriate permissions to update the document(s) in question.

Protected collections interact with the MarkLogic Server security model in three additional ways:

  • Protected collections can be configured using the security module of the Admin Interface or by means of the POST /manage/v2/protected-collections REST endpoint.
  • Protected collections specify the roles that have read, insert and/or update permissions for the protected collection.
  • Collection permissions control who can add documents to a protected collection, but they do not provide document access control. You must use document permissions to control document access. For example, a user with read permissions on a document in a protected collection can read the document whether or not they have any permissions on the collection.
  • You can only add a document to a protected collection if you have insert or update permissions on the collection, as well as appropriate document permissions.

Unprotected Collections

To add to the database a new document that belongs to one or more unprotected collections, the user must have (directly or indirectly) the permissions required to add the document. This means that the user must either possess the admin role or have both of the following:

  • The privilege to execute the xdmp:document-load function, if that is the document insertion directive being used.
  • Either the unprotected-uri privilege, the any-uri privilege, or an appropriate URI privilege on the specific path of the document to be inserted. For example, if the document being inserted has the URI /docs/poetry/love.xml, the appropriate URI privileges are /, /docs, /docs/poetry.

To modify the set of collections to which a document belongs, the user must either possess the admin role or have update permissions on the document.

To access an unprotected collection in an XPath expression, no special permissions are used. Access to each of the individual documents that belong to the specified collection is governed by that individual document's read permissions.

Protected Collections

Protected collections enable you to control additions to a collection. They do not provide document access control.

To add to the database a new document that belongs to one or more protected collections, the user must have (directly or indirectly) the permissions required to add the document as well as the permissions required to add to the protected collection(s). This means that the user must either possess the admin role or have all of the following:

  • The insert permission on the protected collection.
  • The privilege to execute the xdmp:document-load function, if that is the document insertion directive being used.
  • Either the unprotected-uri privilege, the any-uri privilege, or an appropriate URI privilege on the specific path of the document to be inserted. For example, if the document being inserted has the URI /docs/poetry/love.xml, the appropriate URI privileges are /, /docs, /docs/poetry.

To modify the set of protected collections to which a document belongs, the user must either possess the admin role or have:

  • Update permissions on the collection
  • Update permissions on the document

Collection permissions only affect collection membership operations. Access to the documents in a collection, protected or otherwise, is controlled by document permissions. A user with no permissions on a protected collection can still read, search, update, or delete a document in the protected collection if he has sufficient document permissions.

The user can convert an unprotected collection into a protected collection using the Security Function Library module sec:protect-collection. Access to this library module is dependent on the user's having the protect-collection privilege.

The user can convert a protected collection into an unprotected collection using the Security Function Library module sec:unprotect-collection. Access to this library module is dependent on the user's having the unprotect-collection privilege and update permissions on the protected collection.

Performance Characteristics

MarkLogic's implementation of collections is designed to optimize query performance against large volumes of documents. As with all designs, the implementation involves some trade-offs. This section provides a brief overview of the performance characteristics of collections and includes the following subsections:

Number of Collections to Which a Document Belongs

At document load time, collection information is embedded into the document and stored in the database.

This design enables a MarkLogic database to handle millions of collections without difficulty. It also enables the collection() function itself to be extremely efficient, able to subset large datasets by collection with a single index operation. If the collection() function specifies more than one collection, an additional index operation is required for each collection specified. Assuming queries target similar collections, these index operations should be resolved within cache at extremely high performance.

One trade-off with this design is a practical constraint on the number of collections to which a single document should belong. While there is no architectural limit, the size of the database will grow as the average number of collections per document increases. This database growth is driven by an increase in the size of individual document fragments. The fragment size increases because each collection to which the document belongs embeds a small amount of information in the fragment. As fragments grow, the corresponding storage I/O time increases, resulting in performance degradation. It is important to note that the average number of collections per document does not impact index resolution time, merely the time to retrieve the content (fragments) from storage.

A practical guideline is that a document with fragments averaging 50K in size should not belong to more than 100 collections. This should keep the average fragment size increase to less than 10%.

Adding/Removing Existing Documents To/From Collections

A second trade-off with MarkLogic's implementation of collections is that adding or removing documents from collections once those documents are already in the database can be relatively resource-intensive. Changing the collections to which a document belongs requires rewriting every fragment of the document. For large documents, this can be demanding on both CPU and I/O resources. If collection membership is highly dynamic in your application, a better approach may be to use elements within the document itself to characterize membership.

« Previous chapter
Next chapter »