When loading data into a database, you have the option of specifying how XML documents are partitioned for storage into smaller blocks of information called fragments. For large XML documents, size can be an issue, and using fragments may help manage performance of your system. In general, fragments for XML documents should be sized between 10K and 100K. Fragments set too small or too big can slow down performance, so proper fragment sizing is important.
The actual fragmentation of an XML document is completely transparent to an application developer. At the application level, the document appears to be a single integral structure, regardless of how it is stored and managed as fragments on disk. Fragmentation is an application-transparent tuning mechanism.
However, fragmentation does impact relevance ranking. The relevance-ranking algorithm considers both term frequency within a target piece of content and overall term frequency within the database to rank results by relevance. Rather than consider term frequency across the entire XML document for ranking purposes, MarkLogic Server considers term frequency within the individual fragment (and its descendants) being ranked. Consequently, different fragmentation strategies may impact relevance rankings--particularly in situations when a single fragment may straddle multiple XML structures that you are trying to differentiate on a relevance basis.
With MarkLogic Server, you specify fragmentation rules that are used to partition your XML documents. These rules are applied one document at a time. However, fragmentation rules are specified at the database level--on the assumption that databases contain many documents with similar structures where the same fragmentation rules should be applied.
Fragmentation rules are applied to documents during document loads, updates, and database reindexing. Specifying additional fragmentation rules after documents have been loaded causes future updates and/or reindexing of those documents to use the new fragmentation rules, but does not change the fragmentation of existing documents (if reindex enable
is set to true
, however, the documents will eventually be reindexed and take on the new fragmentation policy). As a result, if you want to change the fragmentation rules for already loaded content, you will have to reload your documents or reindex the database so that your new fragmentation rules can take effect.
Use the following procedures for managing fragmentation rules:
Proper fragmentation is important to performance. Before you specify how to fragment the XML data being loaded, you need to plan your fragmentation strategy. Apply the following guidelines:
in memory tree size
parameter to 1 to 2 MB larger than your largest text or small binary file. The largest small binary file size is always constrained by the large size threhold database configuration setting.After you decide how to fragment your data, you can use either of the following methods:
Both methods turn your fragmentation strategy into concrete rules for the system.
If a document contains many instances of an XML structure that share a common element name, then these structures make sensible fragments. With MarkLogic Server, you can use this common element name as a fragment root.
The following diagram shows an XML document rooted at <CitationSet>
that contains many instances of a <Citation>
node. Each <Citation>
node contains further XML and averages between 15K and 20K in size. Based on this information, <Citation>
is a sensible element to use as a fragment root:
If your document contains many different XML substructures, each of which is a good candidate to be a fragment, then it would be time consuming to specify each substructure as a fragment root. Instead, you can specify fragments by setting the parent of these substructures to be a fragment parent--so that every substructure under this parent becomes a separate fragment, regardless of its name.
The following diagram shows a document with substructures of different names:
In this case, you can use the <Products>
element as a fragment parent, and the <Books>
, <Movies>
, <Music>
, <Games>
and <Toys>
children automatically become fragments.
To define a rule for a fragment root, complete the following procedure:
Every XML element is associated with a namespace. For the fragment rule to be precise, you must specify the namespace of the XML element. Leaving the namespace URI field blank specifies the universal unnamed namespace.
Alternatively, you can specify that the rule for the fragment root is namespace independent by putting an asterisk (*) in the namespace URI field.
The local name is the name of the XML element used as the root of a fragment. If you have more than one fragment root rule associated with the specified namespace, you can provide a comma-separated list of element names.
The new fragment root rules are added to the database. These rules are applied to XML documents loaded into the specified database from this point on.
To define a rule for a fragment parent, perform the following steps:
Every XML element is associated with a namespace. For the fragment rule to be precise, you must specify the namespace of the XML element. Leaving the namespace URI field blank specifies the universal unnamed namespace.
Alternatively, you can specify that the rule for the fragment root is namespace independent by putting an asterisk (*) in the namespace URI field.
The local name is the name of the parent XML element whose children will be fragment roots. If you have more than one fragment parent rule associated with the specified namespace, you can provide a comma-separated list of element names.
The new fragment rules are added to the database. These rules are applied to XML documents loaded into the specified database from this point on.
To view fragment rules that are in effect, perform the following steps:
The following example shows that the Documents database has only one rule defined for a fragment parent. The rule states that any direct child of an <RDF> element, regardless of the namespace for the <RDF> element, should form the root of a fragment:
To delete fragment rules for a specific database, perform the following steps:
The fragment rule is dropped from the database.
Deleting fragment rules has no impact on the fragmentation that has already been applied to documents loaded into the database, unless reindexing is enabled for the database.