Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 10

Browsing With Lexicons

MarkLogic Server allows you to create lexicons, which are lists of unique words or values, either throughout an entire database (words only) or within named elements or attributes (words or values). Also, you can define lexicons that allow quick access to the document and collection URIs in the database, and you can create word lexicons on named fields. This chapter describes the lexicons you can create in MarkLogic Server and describes how to use the API to browse through them. This chapter includes the following sections:

About Lexicons

A word lexicon stores all of the unique, case-sensitive, diacritic-sensitive words, either in a database, in an element defined by a QName, or in an attribute defined by a QName. A value lexicon stores all of the unique values for an element or an attribute defined by a QName (that is, the entire and exact contents of the specified element or attribute). A value co-occurrences lexicon stores all of the pairs of values that appear in the same fragment. A geospatial lexicon returns geospatial values from the geospatial index. A range lexicon stores buckets of values that occur within a specified range of values. A URI lexicon stores the URIs of the documents in a database, and a collection lexicon stores the URIs of all collections in a database.

All lexicons determine their order and uniqueness based on the collation specified (for xs:string types), and you can create multiple lexicons on the same object with different collations. For information on collations, see Collations. You can also create value lexicons on non-string values.

All of these types of lexicons have the following characteristics:

  • Lexicon terms and values are case-sensitive.
  • Lexicon terms and values are unstemmed.
  • Lexicon terms and values are diacritic-sensitive.
  • Lexicon terms and values do not have any relevance information associated with them.
  • Uniqueness in lexicons is based on the specified collation of the lexicon.
  • Lexicon terms in word lexicons do not include any punctuation. For example, the term case-sensitive in a database will be two terms in the lexicon: case and sensitive.
  • Lexicon values in value lexicons do include punctuation.
  • In order to perform lexicon-based queries, the appropriate lexicon must be created. If the lexicon has not been created, the lexicon query will throw an exception.
  • Lexicons are used with the Search API to create constraints. Lexicons based on range indexes are used to create value constraints, which are used for facets. For details on the Search API, constraints, and facets, see Search API: Understanding and Using.

Even though the lexicons store terms case-sensitive, unstemmed, and diacritic-sensitive, you can still do case-insensitive and diacritic-insensitive lexicon-based queries by specifying the appropriate option(s). For details on the syntax, see the MarkLogic XQuery and XSLT Function Reference.

Creating Lexicons

You must create the appropriate lexicon before you can run lexicon-based queries. You can create lexicons using the Admin Interface or the Admin API. For detailed information on creating lexicons, see the Text Indexing and Element/Attribute Range Indexes and Lexicons chapters of the Administrator's Guide. You must complete at least one of the following task before you can successfully run lexicon-based queries:

  • Create/enable the lexicon before you load data into the database, or
  • Reindex the database after creating/enabling the lexicon, or
  • Reload the data after creating/enabling the lexicon.

The following is a brief summary of how to create each of the various types of lexicons:

  • To create a word lexicon for the entire database, enable the word lexicon setting on the Admin Interface Database Configuration page (Databases > db_name) and specify a collation for the lexicon (for example, http://marklogic.com/collation/ for the UCA Root Collation).
  • To create an element word lexicon, specify the element namespace URI, local name, and collation on the Admin Interface Element Word Lexicon Configuration page (Databases > db_name > Element Word Lexicons).
  • To create an element attribute word lexicon, specify the element and attribute namespace URIs, local names, and collation on the Admin Interface Element Attribute Word Lexicon Configuration page (Databases > db_name > Attribute Word Lexicons).
  • To create an element value lexicon, specify the element namespace URI and local name, the collation (for xs:string), and the type (for example, xs:string) on the Admin Interface Range Element Index Configuration page (Databases > db_name > Element Indexes).
  • To create an element attribute value lexicon, specify the element and attribute namespace URIs and local names, the collation (for xs:string), and the type (for example, xs:string) on the Admin Interface Range Element-Attribute Index Configuration page (Databases > db_name > Attribute Indexes).
  • To create a field value lexicon, first create a field in the Admin Interface (Databases > db_name > Fields). Then create the field value lexicon by specifying the type (for example, xs:string) and the field name on the Admin Interface Field Range Index Configuration page (Databases > db_name > Field Range Indexes).

    If your system is set to reindex/refragment, newly created lexicons will not be available until reindexing is completed.

Word Lexicons

There are several types of word lexicons:

Word Lexicon for the Entire Database

A word lexicon covers the entire database, and holds all of the unique terms in the database, with uniqueness determined by the specified collation. You enable the word lexicon in the database page of the Admin Interface by enabling the word lexicon database setting. If the database already has content loaded, you must reindex the database before you can perform any lexicon queries. The following are the APIs for the word lexicon:

Element/Element-Attribute Word Lexicons

An XML element word lexicon or an XML element-attribute word lexicon contains all of the unique terms in the specified element or attribute, with uniqueness determined by the specified collation. The element word lexicons only contain words that exist in immediate text node children of the specified element as well as any text node children of elements defined in the Admin Interface as element-word-query-throughs or phrase-throughs; it does not include words from any other children of the specified element.

Create element and element-attribute word lexicons in the Admin Interface with the Element Word Lexicons and Attribute Word Lexicons links under the database in which you want to create the lexicons. You can also use the following Admin API functions:

Use the following functions to query element and element attribute word lexicons:

JSON Property Word Lexicons

A JSON property word lexicon contains all of the unique terms in the specified JSON property, with uniqueness determined by the specified collation. A JSON property word lexicon only contains words occurring in string values of the specified JSON property.

Create a JSON property word lexicon using the interfaces for XML element word lexicons. To create a lexicon with the Admin Interface, use the Element Word Lexicons section under the database in which you want to create the lexicon. To create a lexicon with the Admin API, use the function admin:database-add-element-word-lexicon.

Use the following functions to query JSON property word lexicons:

Field Word Lexicons

A field is a named object that you create at the database level, and it defines a set of elements which can be accessed together through the field. You can create word lexicons on fields, which list all of the unique words that are included in the field. You can create field word lexicons in the configuration page for each field. Like all other lexicons, field word lexicons are unique to a collation, and you can, if you need to, create multiple lexicons in different collations. For details on fields, see Fields Database Settings in the Administrator's Guide. The following are the APIs for the field word lexicons:

Element/Element-Attribute/Path Value Lexicons

An element value lexicon, element-attribute value lexicon, or a path value lexicon contains all of the unique values in the specified element or attribute. The values are the entire and exact contents of the specified element or attribute. You create element and element-attribute value lexicons in the Admin Interface by creating a range index of a particular type (for example, xs:string) for the element or attribute to which you want the value lexicon. The following are the APIs for the element, element-attribute, and path value lexicons:

The cts:element-values and cts:element-value-match functions are used to return values from element value lexicons implemented using element range indexes. The cts:element-attribute-values and cts:element-attribute-value-matchfunctions are used to return values from attribute value lexicons implemented using attribute range indexes. The cts:values and cts:value-match functions are used to return values from path value lexicons implemented using path range indexes. A path value lexicon can be either an element or an attribute.

You can only create element value lexicons on simple elements (that is, the elements cannot have any element children).

When you have a value lexicon on an element or an attribute, you can also use the cts:frequency API to get fast and accurate counts of how many times the value occurs. You can either get counts of the number of fragments that have at least one instance of the value (using the default fragment-frequency option to the value lexicon APIs) or you can get total counts of values in each item (using the item-frequency option). For details and examples, see the documentation for cts:frequency and for the value lexicon APIs in the MarkLogic XQuery and XSLT Function Reference.

Field Value Lexicons

A field value lexicon contains all of the unique values for the specified field. You create field value lexicons in the Admin Interface by creating a range index of a particular type (for example, xs:string) for the field to which you want a field lexicon. The following are the APIs for field value lexicons:

When you have a value lexicon on a field, you can also use the cts:frequency API to get fast and accurate counts of how many times the value occurs. You can either get counts of the number of fragments that have at least one instance of the value (using the default fragment-frequency option to the value lexicon APIs) or you can get total counts of values in each item (using the item-frequency option). For details and examples, see the documentation for cts:frequency and for the value lexicon APIs in the MarkLogic XQuery and XSLT Function Reference.

Field value lexicons are useful in cases where something you want to treat as a discreet value does not occur in a single element or attribute. For example, consider the following XML structure:

<name>
  <first>Raymond</first>
  <middle>Clevie</middle>
  <last>Carver</last>
</name>

If you want to normalize names in the form firstname lastname, then you can create a field on this structure. The field might include the element name and exclude the element middle. The value of this instance of the field would then be Raymond Carver. If your document contained other name elements with the same structure, their values would be derived similarly. The range index for the field stores each unique instance of the field value.

For details on fields, see Fields Database Settings in the Administrator's Guide.

Value Co-Occurrences Lexicons

Value co-occurrence lexicons find pairs of element or attribute values that occur in the same fragment. If you have positions enabled in your range indexes, you can also specify a maximum word distance (proximity=N option) that the values must be from each other in order to match as a co-occurring pair. The following APIs support these lexicons:

These APIs return XML structures containing the pairs of co-occurring values. You can use cts:frequency on the output of these functions to find the frequency (the counts) of each co-occurrence.

Additionally, you can get co-occurrences from geospatial lexicons, as described in Geospatial Lexicons.

Because the URI and collection lexicons are implemented as range indexes, you can specify a special QName for the document URI or collection URI lexicons to get the list of values with their URI or collections. The QNames are in the http://marklogic.com/xdmp namespace and the URI index has the local name document and the collection index has the local name collection, both using the http://marklogic.com/collation/codepoint collation. You can then use these QNames (for example, xdmp:document and xdmp:collection, as xdmp is bound to that namespace by default in the 1.0-ml dialect) in cts:element-value-co-occurrences as one of the element QNames to find element value/document URI pairs or element value/collection URI pairs. Make sure to also specify the codepoint collation option for these QNames (for example, "collation-2=http://marklogic.com/collation/codepoint" if you are specifying one of these QNames as the second argument to cts:element-value-co-occurences).

Consider the following example with a document with the URI /george.xml that looks as follows:

<text>
  <e:person xmlns:e="http://marklogic.com/entity">George   Washington</e:person> was the first President of the 
  <e:gpe xmlns:e="http://marklogic.com/entity">United States</e:gpe>.
  <e:person xmlns:e="http://marklogic.com/entity">Martha
  Washington</e:person> was his wife.  They lived at 
  <e:location xmlns:e="http://marklogic.com/entity">Mount   Vernon</e:location>.
</text>

Before creating this document, create two string element range indexes: one for the e:person element and one for the e:location element, where e is bound to the namespace http://marklogic.com/entity.

Now you can run the following co-occurrence query to find all co-occurring people and locations:

xquery version "1.0-ml";

declare namespace e="http://marklogic.com/entity";
cts:element-value-co-occurrences(xs:QName("e:person"),
     xs:QName("e:location"))

This produces the following output:

<cts:co-occurrence xmlns:cts="http://marklogic.com/cts"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <cts:value xsi:type="xs:string">George Washington</cts:value>
  <cts:value xsi:type="xs:string">Mount Vernon</cts:value>
</cts:co-occurrence>
<cts:co-occurrence xmlns:cts="http://marklogic.com/cts"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <cts:value xsi:type="xs:string">Martha Washington</cts:value>
  <cts:value xsi:type="xs:string">Mount Vernon</cts:value>
</cts:co-occurrence>

If you wanted to get the frequency of how many of each co-occurring pair exist, either in each item or in each fragment (depending on whether you use the item-frequency or the default fragment-frequency option), use cts:frequency on the lexicon lookup as follows:

xquery version "1.0-ml";
declare namespace e="http://marklogic.com/entity";
for $x in cts:element-value-co-occurrences(xs:QName("e:person"), 
               xs:QName("e:location"))
return cts:frequency($x)
(: 
   returns a frequency of 1 for each pair if /george.xml 
   is the only document in the database 
:)

Geospatial Lexicons

The following APIs use geospatial point lexicons:

You must create the appropriate geospatial point index to use its corresponding geospatial lexicon. For example, to use cts:element-geospatial-values, you must first create a geospatial element point index. Use the Admin Interface (Databases > database_name > Geospatial Point Indexes) or the Admin API to create geospatial indexes for a database.

The *-boxes APIs return XML elements that show buckets of ranges, each bucket containing one or more cts:box values.

To learn more about geospatial features in MarkLogic, see Geospatial Search Applications.

Range Lexicons

The range lexicons return values divided into buckets. The ranges are ranges of values of the type of the lexicon. A range index is required on the element(s) or attribute(s) specified in the range lexicon. The following APIs support these lexicons:

Additionally, there are the following geospatial box lexicons to find ranges of geospatial values divided into buckets:

The range lexicons return a sequence of XML nodes, one node for each bucket. You can use cts:frequency on the result set to determine the number of items (or fragments) in the buckets. The "empties" option specifies that an XML node is returned for buckets that have no values (that is, for buckets with a frequency of zero). By default, empty buckets are not included in the result set. For details about all of the options to the range lexicons, see the MarkLogic XQuery and XSLT Function Reference.

URI and Collection Lexicons

The URI and Collection lexicons respectively list all of the document URIs and all of the collection URIs in a database. To enable or disable these lexicons, use the Database Configuration page in the Admin Interface. Use these lexicons to quickly search through all of the URIs in a database. The following APIs support these lexicons:

Performing Lexicon-Based Queries

Lexicon queries return a sequence of words (or values in the case of value lexicons) from the appropriate lexicon. For string values, the words or values are returned in collation order, and the terms are case- and diacritic-sensitive. For other data types, the values are returned in order, where values that are greater than return before values that are less than. This section lists the lexicon APIs and provides some examples and explanation of how to perform lexicon-based queries. It includes the following parts:

Lexicon APIs

Use the following Search Built-in XQuery APIs to perform lexicon-based queries:

In order to perform lexicon-based queries, the appropriate lexicon must be created. If the lexicon has not been created, the lexicon query will throw an exception.

The cts:*-words APIs return all of the words in the lexicon (or all of the words from a starting point if the optional $start parameter is used). The cts:*-match APIs return only words in the lexicon that match the wildcard pattern.

For details about the individual functions, see the Search APIs in the MarkLogic XQuery and XSLT Function Reference.

Constraining Lexicon Searches to a cts:query Expression

You can use the $query option of the lexicon APIs to constrain your lexicon lookups to fragments matching a particular cts:query expression. When you specify the $query option, the lexicon search returns all of the terms (or values for lexicon value queries) in the fragments that match the specified cts:query expression.

For example, the following is a query against a database of all of Shakespeare's plays fragmented at the SCENE level:

cts:words("et", (), "et tu")[1 to 10]

=> et ete even ever every eyes fais faith fall familiar

This query returns the first 10 words from the lexicon of words, starting with the word et, for all of the fragments that match the following query:

cts:word-query("et tu")

In the case of the Shakespeare database, there are 2 scenes that match this query, one from The Tragedy of Julius Caesar and one from The Life of Henry the Fifth. Note that this is a different set of words than if you omitted the $query parameter from the search. The following shows the query without the $query parameter. The results represent the 10 words in the entire word lexicon for all of the Shakespeare plays, starting with the word et:

cts:words("et")

=> et etc etceteras ete eternal eternally eterne eternity 
   eternized etes

Note that when you constrain a lexicon lookup to a cts:query expression, it returns the lexicon items for any fragment in which the cts:query expression returns true. No filtering is done to the cts:query expression to validate that the match actually occurs in the fragment. In some cases, depending on the index options you have set, it can return true in cases where there is no actual match. For example, if you do not have fast element word searches enabled in the database configuration, it is possible for a cts:element-word-query to match a fragment because both the word and the element exist in the fragment, but not in the same element. The filtering stage of cts:search resolves these discrepancies, but they are not resolved in lexicon APIs that use the $query option. For details about how this works, see Understanding the Search Process and Understanding Unfiltered Searches sections in the Query Performance and Tuning Guide.

Using the Match Lexicon APIs

Each type of lexicon (word, element word, element-attribute word, element value, and element-attribute value) has a function (cts:*-match) which allows you to use a wildcard pattern to constrain the lexicon entries returned; the cts:*-match APIs return only words or values in the lexicon that match the wildcard pattern. The following query finds all of the words in the lexicon that start with zou:

cts:word-match("zou*")

=> Zounds zounds

It returns both the uppercase and lowercase words that match because search defaults to case-insensitive when all of the letters in the base of the wildcard pattern are lowercase. If you want to match the pattern case-sensitive, diacritic-sensitive, or with some other option, add the appropriate option to the query. For example:

cts:word-match("zou*", "case-sensitive")

=> zounds

For details on the query options, see the MarkLogic XQuery and XSLT Function Reference. For details on wildcard searches, see Understanding and Using Wildcard Searches.

Determining the Number of Fragments Containing a Lexicon Term

The lexicon contains the unique terms in a database. To minimize redundant disk I/Os when you are performing estimates following a query-constrained word lexicon lookup, and therefore for this type of query to be resolved as efficiently as possible, the cts:word-query should have the following characteristics:

  • Specify the unstemmed, case-sensitive, and diacritic-sensitive options.
  • Specify a weight of 0.

These characteristics ensure that the word being estimated is exactly the same as the word returned from the lexicon.

For example, if you want to figure out how many fragments contain a lexicon term, you can perform a query like the following:

<words>{
for $word in cts:words("aardvark", (), 
     cts:directory-query("/", "infinity"))[1 to 1000]
  let $count := xdmp:estimate(cts:search(fn:doc(),
        cts:word-query($word,("unstemmed","case-sensitive",
                                  "diacritic-sensitive"),0)))
return <word text="{$word}" count="{$count}"/> }
</words>

This query returns one word element per lexicon term, along with the matching term and counts of the number of fragments that have the term, under the specified directory (/), starting with the term aardvark. Sample output from this query follows:

<words>
  <word text="aardvark" count="10"/>
  <word text="aardvarks" count="10"/>
  <word text="aardwolf" count="5"/>
...
</words>
« Previous chapter
Next chapter »