Phrasing Control
By default, MarkLogic Server assumes that any XML element constructor acts as a phrase boundary. This means that phrase searches (for example, searches for sequences of terms) will not match a sequence of terms that contains one or more XML element constructors. Phrasing control lets you specify which XML elements should be transparent to phrase boundaries (for example, a bold or italic element), and which XML elements should be ignored for phrase purposes (for example, footnotes or graphic captions).
For example, consider this sample XML fragment:
<paragraph> These two words <italic>are italicized</italic>. The italic element <footnote>Elements are defined in the W3C XML standard.</footnote> is a standard part of this document's schema. </paragraph>
By default, MarkLogic Server would extract the following five sequences of text for phrase matching purposes (ignoring punctuation and case for simplicity):
“these two words”
“are italicized”
“the italic element”
“elements are defined in the w3c xml standard”
“is a standard part of this document's schema”
If you then attempted to match the phrases “words are italicized” or “element is a standard part” against this XML fragment, no matches would be found, because of the embedded XML element constructors.
In fact, a human looking at this XML fragment would realize that the italic
element should be transparent for phrasing purposes, and that the footnote
element is a completely independent text container. Seen from this viewpoint, the XML fragment shown above contains only two text sequences (again, ignoring punctuation and case for simplicity):
“these two words are italicized the italic element is a standard part of this document's schema”
“elements are defined in the w3c xml standard”
In this case, “words are italicized” and “element is a standard part” would each properly generate a match. But a search for “the w3c xml standard is a standard” would not result in a match.
MarkLogic Server lets you achieve this type of phrasing control by specifying particular XML element names as phrase-through, phrase-around, and element-word-query-through elements:
Type |
Definition |
---|---|
|
Elements that should not create phrase boundaries (as in the example above, |
|
Elements whose content should be completely ignored in the context of the current phrase (as in the example above, |
Phrase controls are configured on a per-database basis. You should complete this configuration before loading any documents into the specified database; otherwise, in order for the changes to take effect with your existing content, you must either reload the content or reindex the database after changing the configuration.