Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 22

Using the Thesaurus Functions

Using the Thesaurus Functions

MarkLogic Server includes functions that enable applications to provide thesaurus capabilities. Thesaurus applications use thesaurus (synonym) documents to find words with similar meaning to the words entered by a user. A common example application expands a user search to include words with similar meaning to those entered in a search. For example, if the application uses a thesaurus document that lists car brands as synonyms for the word car, then a search for car might return results for Alfa Romeo, Ford, and Hyundai, as well as for the word car.

This chapter describes how to use the thesaurus functions and contains the following sections:

The Thesaurus Module

The thesaurus functions are installed into the following XQuery module file:

  • install_dir/Modules/MarkLogic/thesaurus.xqy

where install_dir is the directory in which MarkLogic Server is installed. The functions in the thesaurus module use the thsr: namespace prefix, which you must specify in your XQuery program (or specify your own namespace). To use any of the functions, include the module and namespace declaration in the prolog of your XQuery program as follows:

import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

Function Reference

The reference information for the thesaurus module functions is included in the MarkLogic XQuery and XSLT Function Reference available through developer.marklogic.com.

Thesaurus Schema

Any thesaurus documents loaded into MarkLogic Server must conform to the thesaurus schema, installed into the following file:

  • install_dir/Config/thesaurus.xsd

where install_dir is the directory in which MarkLogic Server is installed.

Capitalization

Thesaurus documents and the thesaurus functions are case-sensitive. Therefore, a thesaurus term for Car is different from a thesaurus term for car and any lookups for these terms are case-sensitive.

If you want your applications to be case-insensitive (that is, if you want the term Car to return thesaurus entries for both Car and car), your application must handle the case of the terms you want to lookup. There are several ways to handle case. For example, you can lowercase all the entries in your thesaurus documents and then lowercase the terms before performing the lookup from the thesaurus. For an example of lowercasing terms in a thesaurus document, see Lowercasing Terms When Inserting a Thesaurus Document.

Managing Thesaurus Documents

You can have any number of thesaurus documents in a database. You can also add to or modify any thesaurus documents that already exist. This section describes how to load and update thesaurus documents, and contains the following sections:

Loading Thesaurus Documents

To use a thesaurus in a query, use the thsr:load function or the thsr:insert function to load a document as a thesaurus. For example, to load a thesaurus document with a URI /myThsrDocs/wordnet.xml, execute a query similar to the following:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:load("c:\thesaurus\wordnet.xml", "/myThsrDocs/wordnet.xml")

This XQuery adds all of the <entry> elements from the c:\thesaurus\wordnet.xml file to a thesaurus with the URI /myThsrDocs/wordnet.xml. If the document already exists, then it is overwritten with the new content from the specified file.

If you have a thesaurus document that is too large to fit into an in-memory list, you can split the thesaurus into multiple documents. If you do this, you must specify all of the thesaurus documents in the thesaurus APIs that take URIs as a parameter. Also, ensure that there are no duplicate entries between the different thesaurus documents.

Lowercasing Terms When Inserting a Thesaurus Document

You can use the thsr:insert function to perform transformation on a document before inserting it as a thesaurus document. The following example shows how you can use the xdmp:get function to load a document into memory, then walk through the in-memory document and construct a new document which has lowercase terms.

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:insert("newThsr.xml",
     let $thsrMem := xdmp:get("C:\myFiles\thesaurus.xml") 
     return 
<thesaurus xmlns="http://marklogic.com/xdmp/thesaurus">
{
   for $entry in $thsrMem/thsr:entry 
   return 
       (: Write out and lowercase the term, then write out all of 
          the children of this entry except for the term, which was
          already written out and lowercased :)
     <thsr:entry>
       <thsr:term>{lower-case($entry/thsr:term)}</thsr:term>
          {$entry/*[. ne $entry/thsr:term]}
      </thsr:entry>
}
</thesaurus>
            )

Loading the XML Version of the WordNet Thesaurus

You can download an XML version of the WordNet from the MarkLogic Developer site (developer.marklogic.com). Once you download the thesaurus file, you can load it as a thesaurus document using the thsr:load function.

Perform the following steps to download and load the WordNet Thesaurus:

  1. Go to the Workshop page of developer.marklogic.com:
    http://developer.marklogic.com/code/default.xqy
  2. Navigate to the thesaurus document section and find the thesaurus.xml document.
  3. Save thesaurus.xml to a file (for example, c:\thesaurus\thesaurus.xml).
  4. Load the thesaurus with a command similar to the following:
    xquery version "1.0-ml";
    import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                                 at "/MarkLogic/thesaurus.xqy";
    
    thsr:load("c:\thesaurus\thesaurus.xml", "/myThsrDocs/wordnet.xml")
This loads the thesaurus with a URI of /myThsrDocs/wordnet.xml. You can now use this URI with the thesaurus module functions.

Updating a Thesaurus Document

Use the following thesaurus functions to modify existing thesaurus documents:

Additionally, the thsr:insert function adds entries to an existing thesaurus document (as well as creates a new one if one does not exist at the specified URI).

The transactional unit in MarkLogic Server is a query; therefore, if you are performing multiple updates to the same thesaurus document, be sure to perform those updates as part of separate queries. You can place a semi-colon between the update statements to start a new query (and therefore a new transaction). If you use a semicolon to start any new queries that uses thesaurus functions, each query must include the import statement in the prolog to resolve the thesaurus namespace.

Security Considerations With Thesaurus Documents

Thesaurus documents are stored in XML format in the database. Therefore, they can be queried just like any other document. Note the following about security and thesaurus documents:

  • By default, thesaurus documents are loaded into the following collections:
    • http://marklogic.com/xdmp/documents
    • http://marklogic.com/xdmp/thesaurus
  • Thesaurus documents are loaded with the default permissions of the user who loads them. Make sure users who load thesaurus documents have approriate privileges, otherwise the documents might not have the needed permissions for reading and updating. For more information, see Setting Document Permissions in the Loading Content Into MarkLogic Server Guide.
  • If you want to control access (read and/or write) to thesaurus documents beyond the default permissions with which the documents are loaded, perform an xdmp:document-set-permissions after a thsr:load operation.

Example Queries Using Thesaurus Management Functions

This section includes the following examples:

Example: Adding a New Thesaurus Entry

The following XQuery uses the thsr:set-entry function to add an entry for Car to the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:set-entry("/myThsrDocs/wordnet.xml", 
<entry xmlns="http://marklogic.com/xdmp/thesaurus">
    <term>Car</term>
    <synonym>
      <term>Ford</term>
      <part-of-speech>noun</part-of-speech>
    </synonym>
    <synonym>
      <term>automobile</term>
      <part-of-speech>noun</part-of-speech>
    </synonym>
    <synonym>
      <term>Fiat</term>
      <part-of-speech>noun</part-of-speech>
    </synonym>
</entry>)

If the /myThsrDocs/wordnet.xml thesaurus has an identical entry, there will be no change to the thesaurus. If the thesaurus has no entry for car or has an entry for car that is not identical (that is, where the nodes are not equivalent), it will add the new entry. The new entry is added to the end of the thesaurus document.

Example: Removing a Thesaurus Entry

The following XQuery uses the thsr:remove-entry function to remove the entry for Car from the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:remove-entry("/myThsrDocs/wordnet.xml", 
                  thsr:lookup("/myThsrDocs/wordnet.xml","Car")[2])

This removes the second entry for Car from the /myThsrDocs/wordnet.xml thesaurus document.

Example: Removing Term(s) from a Thesaurus

The following XQuery uses the thsr:remove-term function to remove all entries for the term Car from the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:remove-term("/myThsrDocs/wordnet.xml", "Car")

This removes all of the Car terms from the /myThsrDocs/wordnet.xml thesaurus document. If you only have a single term for Car in the thesaurus, the thsr:remove-term function does the same as the thsr:remove-entry function.

Example: Adding a Synonym to a Thesaurus Entry

The following XQuery adds the synonym Alfa Romeo to the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:add-synonym(thsr:lookup("/myThsrDocs/wordnet.xml", "car"), 
                 <thsr:synonym>
                    <thsr:term>Alfa Romeo</thsr:term>
                 </thsr:synonym>)

This query assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. For example, if you wanted to add the synonym to the first car entry in the thesaurus, specify the first argument as follows:

thsr:lookup("/myThsrDocs/wordnet.xml", "car")[1]
Example: Removing a Synonym From a Thesaurus

The following XQuery removes the synonym Fiat from the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:remove-synonym(thsr:lookup("/myThsrDocs/wordnet.xml", "car"), 
                 <thsr:synonym>
                    <thsr:term>Fiat</thsr:term>
                 </thsr:synonym>)

This query assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. For example, if you wanted to remove the synonym from the first car entry in the thesaurus, specify the first argument as follows:

thsr:lookup("/myThsrDocs/wordnet.xml", "car")[1]

Expanding Searches Using a Thesaurus

You can expand a search to include terms from a thesaurus as well as the terms entered in the search. Consider the following query:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy";

cts:search(
doc("/Docs/hamlet.xml")//LINE,
thsr:expand(
  cts:word-query("weary"), 
  thsr:lookup("/myThsrDocs/thesaurus.xml", "weary"),
  (), 
  (), 
  () )
)

This query finds all of the lines in Shakespeare's Hamlet that have the word weary or any of the synonyms of the word weary.

Thesaurus entries can have many synonyms, though. Therefore, when you expand a search, you might want to create a user interface in the application which provides a form allowing a user to specify the desired synonyms from the list returned by thsr:expand. Once the user chooses which synonyms to include in the search, the application can add those terms to the search and submit it to the database.