Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 23

Using the Thesaurus Functions

MarkLogic Server includes functions that enable applications to provide thesaurus capabilities. Thesaurus applications use thesaurus (synonym) documents to find words with similar meaning to the words entered by a user. A common example application expands a user search to include words with similar meaning to those entered in a search. For example, if the application uses a thesaurus document that lists car brands as synonyms for the word car, then a search for car might return results for Alfa Romeo, Ford, and Hyundai, as well as for the word car.

This chapter describes how to use the thesaurus functions and contains the following sections:

The Thesaurus Module

There is an XQuery module to perform thesarus functions. You can use this module either in XQuery or in Server-Side JavaScript. The thesaurus functions are installed into the following XQuery module file:

  • install_dir/Modules/MarkLogic/thesaurus.xqy

where install_dir is the directory in which MarkLogic Server is installed. The functions in the thesaurus module use the thsr: namespace prefix, which you must specify in your XQuery program (or specify your own namespace). To use any of the functions in XQuery, include the module and namespace declaration in the prolog of your XQuery program as follows:

import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

To use any of the functions in a JavaScript program, include a line similar to the following in your Server-Side JavaScript program:

const thsr = require("/MarkLogic/thesaurus");

Function Reference

The reference information for the thesaurus module functions is included in the MarkLogic XQuery and XSLT Function Reference and the MarkLogic Server-Side JavaScript Function Reference available through docs.marklogic.com.

Thesaurus Schema

Any thesaurus documents loaded into MarkLogic Server must conform to the thesaurus schema, installed into the following file:

  • install_dir/Config/thesaurus.xsd

where install_dir is the directory in which MarkLogic Server is installed.

Capitalization

Thesaurus documents and the thesaurus functions are case-sensitive. Therefore, a thesaurus term for Car is different from a thesaurus term for car and any lookups for these terms are case-sensitive.

If you want your applications to be case-insensitive (that is, if you want the term Car to return thesaurus entries for both Car and car), your application must handle the case of the terms you want to lookup. There are several ways to handle case. For example, you can lowercase all the entries in your thesaurus documents and then lowercase the terms before performing the lookup from the thesaurus. For an example of lowercasing terms in a thesaurus document, see Lowercasing Terms When Inserting a Thesaurus Document.

Managing Thesaurus Documents

You can have any number of thesaurus documents in a database. You can also add to or modify any thesaurus documents that already exist. This section describes how to load and update thesaurus documents, and contains the following sections:

Loading Thesaurus Documents in XQuery

To use a thesaurus in a query, use the thsr:load function or the thsr:insert function to load a document as a thesaurus. For example, to load a thesaurus document with a URI /myThsrDocs/wordnet.xml, execute a query similar to the following:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:load("c:\thesaurus\wordnet.xml", "/myThsrDocs/wordnet.xml")

This XQuery adds all of the <entry> elements from the c:\thesaurus\wordnet.xml file to a thesaurus with the URI /myThsrDocs/wordnet.xml. If the document already exists, then it is overwritten with the new content from the specified file.

If you have a thesaurus document that is too large to fit into an in-memory list, you can split the thesaurus into multiple documents. If you do this, you must specify all of the thesaurus documents in the thesaurus APIs that take URIs as a parameter. Also, ensure that there are no duplicate entries between the different thesaurus documents.

Loading Thesaurus Documents in JavaScript

To use a thesaurus in a Server-Side JavaScript program, use the thsr.load function or the thsr.insert function to load a document as a thesaurus. For example, to load a thesaurus document with a URI /myThsrDocs/wordnet.xml, execute a query similar to the following:

const thsr = require("/MarkLogic/thesaurus");
declareUpdate();

thsr.load("c:\thesaurus\wordnet.xml", "/myThsrDocs/wordnet.xml")

This JavaScript program adds all of the <entry> elements from the c:\thesaurus\wordnet.xml file to a thesaurus with the URI /myThsrDocs/wordnet.xml. If the document already exists, then it is overwritten with the new content from the specified file.

If you have a thesaurus document that is too large to fit into an in-memory list, you can split the thesaurus into multiple documents. If you do this, you must specify all of the thesaurus documents in the thesaurus APIs that take URIs as a parameter. Also, ensure that there are no duplicate entries between the different thesaurus documents.

Lowercasing Terms When Inserting a Thesaurus Document

You can use the thsr:insert function to perform transformation on a document before inserting it as a thesaurus document. The following example shows how you can use the xdmp:get function to load a document into memory, then walk through the in-memory document and construct a new document which has lowercase terms.

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:insert("newThsr.xml",
     let $thsrMem := xdmp:get("C:\myFiles\thesaurus.xml") 
     return 
<thesaurus xmlns="http://marklogic.com/xdmp/thesaurus">
{
   for $entry in $thsrMem/thsr:entry 
   return 
       (: Write out and lowercase the term, then write out all of 
          the children of this entry except for the term, which was
          already written out and lowercased :)
     <thsr:entry>
       <thsr:term>{lower-case($entry/thsr:term)}</thsr:term>
          {$entry/*[. ne $entry/thsr:term]}
      </thsr:entry>
}
</thesaurus>
            )

Loading the XML Version of the WordNet Thesaurus

You can download an XML version of the WordNet from the MarkLogic Developer site (developer.marklogic.com/code/dictionaries). Once you download the thesaurus file, you can load it as a thesaurus document using the thsr:load XQuery function or the thsr.load JavaScript function.

Perform the following steps to download and load the WordNet Thesaurus:

  1. Go to the code section of developer.marklogic.com and find the following page:
    http://developer.marklogic.com/code/dictionaries
  2. Click the GitHub link.
  3. Navigate to the thesaurus document section and find the thesaurus.xml document.
  4. Save thesaurus.xml to a file (for example, c:\thesaurus\thesaurus.xml). Alternately, clone the GitHub repository.
  5. Load the thesaurus with an XQuery statement similar to the following:
    xquery version "1.0-ml";
    import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                                 at "/MarkLogic/thesaurus.xqy";
    
    thsr:load("c:\thesaurus\thesaurus.xml", "/myThsrDocs/wordnet.xml")
Or you can load the thesaurus in JavaScript with a program similar to the following:
const thsr = require("/MarkLogic/thesaurus");
declareUpdate();

thsr.load("c:\thesaurus\wordnet.xml", "/myThsrDocs/wordnet.xml");
This loads the thesaurus with a URI of /myThsrDocs/wordnet.xml. You can now use this URI with the thesaurus module functions.

Updating a Thesaurus Document

Use the following thesaurus functions to modify existing thesaurus documents:

XQuery Function Server-Side JavaScript Function
thsr:set-entry thsr.setEntry
thsr:add-synonym thsr.addSynonym
thsr:remove-entry thsr.removeEntry
thsr:remove-term thsr.removeTerm
thsr:remove-synonym thsr.removeSynonym

Additionally, the thsr:insert / thsr.insert function adds entries to an existing thesaurus document (as well as creates a new one if one does not exist at the specified URI).

The transactional unit in MarkLogic Server is a query; therefore, if you are performing multiple updates to the same thesaurus document, be sure to perform those updates as part of separate queries. In XQuery, you can place a semi-colon between the update statements to start a new query (and therefore a new transaction). If you use a semicolon to start any new queries that uses thesaurus functions in XQuery, each query must include the import statement in the prolog to resolve the thesaurus namespace.

Security Considerations With Thesaurus Documents

Thesaurus documents are stored in XML format in the database. Therefore, they can be queried just like any other document. Note the following about security and thesaurus documents:

  • By default, thesaurus documents are loaded into the following collections:
    • http://marklogic.com/xdmp/documents
    • http://marklogic.com/xdmp/thesaurus
  • Thesaurus documents are loaded with the default permissions of the user who loads them. Make sure users who load thesaurus documents have approriate privileges, otherwise the documents might not have the needed permissions for reading and updating. For more information, see Setting Document Permissions in the Loading Content Into MarkLogic Server Guide.
  • If you want to control access (read and/or write) to thesaurus documents beyond the default permissions with which the documents are loaded, perform an xdmp:document-set-permissions after a thsr:load operation.

Example Queries Using Thesaurus Management Functions

This section includes the following examples, in both XQuery and JavaScript:

Example: Adding a New Thesaurus Entry in XQuery

The following XQuery uses the thsr:set-entry function to add an entry for Car to the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:set-entry("/myThsrDocs/wordnet.xml", 
<entry xmlns="http://marklogic.com/xdmp/thesaurus">
    <term>Car</term>
    <part-of-speech>noun</part-of-speech>
    <synonym>
      <term>Ford</term>
      <part-of-speech>noun</part-of-speech>
    </synonym>
    <synonym>
      <term>automobile</term>
      <part-of-speech>noun</part-of-speech>
    </synonym>
    <synonym>
      <term>Fiat</term>
      <part-of-speech>noun</part-of-speech>
    </synonym>
</entry>)

If the /myThsrDocs/wordnet.xml thesaurus has an identical entry, there will be no change to the thesaurus. If the thesaurus has no entry for car or has an entry for car that is not identical (that is, where the nodes are not equivalent), it will add the new entry. The new entry is added to the end of the thesaurus document.

Example: Adding a New Thesaurus Entry in JavaScript

The JavaScript thsr.setEntry function allows you to use a JavaScript object to update your thesaurs documents. The following JavaScript uses the thsr.setEntry function to add an entry for Car to the thesaurus with URI /myThsrDocs/wordnet.xml:

const thsr = require("/MarkLogic/thesaurus");
declareUpdate();

thsr.setEntry("/myThsrDocs/wordnet.xml", 
{ 
      "term":"Car",
      "partOfSpeech":"noun",
      "synonyms":[
        {"term":"Ford",
         "partOfSpeech":"noun"
        },
        {"term":"automobile",
         "partOfSpeech":"noun"
        },
        {"term":"Fiat",
         "partOfSpeech":"noun"
        }
      ]
    });

If the /myThsrDocs/wordnet.xml thesaurus has an identical entry, there will be no change to the thesaurus. If the thesaurus has no entry for car or has an entry for car that is not identical (that is, where the nodes are not equivalent), it will add the new entry. The new entry is added to the end of the thesaurus document.

Example: Removing a Thesaurus Entry

The following XQuery uses the thsr:remove-entry function to remove the second entry for Car from the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:remove-entry("/myThsrDocs/wordnet.xml", 
                  thsr:lookup("/myThsrDocs/wordnet.xml","Car")[2])

Similarly, the following is a JavaScript example to do the same thing:

const thsr = require("/MarkLogic/thesaurus");
declareUpdate();

thsr.removeEntry("/myThsrDocs/roget.xml", 
    thsr.lookup("/myThsrDocs/roget.xml","Car").toObject()[1])

This removes the second Car entry from the /myThsrDocs/wordnet.xml thesaurus document.

Example: Removing Term(s) from a Thesaurus in XQuery

The following XQuery uses the thsr:remove-term function to remove all entries for the term Car from the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:remove-term("/myThsrDocs/wordnet.xml", "Car")

This removes all of the Car terms from the /myThsrDocs/wordnet.xml thesaurus document. If you only have a single term for Car in the thesaurus, the thsr:remove-term function does the same as the thsr:remove-entry function.

Example: Removing Term(s) from a Thesaurus in JavaScript

The following JavaScript program uses the thsr.removeTerm function to remove all entries for the term Car from the thesaurus with URI /myThsrDocs/wordnet.xml:

const thsr = require("/MarkLogic/thesaurus");
declareUpdate();

thsr.removeTerm("/myThsrDocs/wordnet.xml", "Car")

This removes all of the Car terms from the /myThsrDocs/wordnet.xml thesaurus document. If you only have a single term for Car in the thesaurus, the thsr.removeTerm function does the same as the thsr.removeEntry function.

Example: Adding a Synonym to a Thesaurus Entry in XQuery

The following XQuery adds the synonym Alfa Romeo to the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:add-synonym(thsr:lookup("/myThsrDocs/wordnet.xml", "car"), 
                 <thsr:synonym>
                    <thsr:term>Alfa Romeo</thsr:term>
                 </thsr:synonym>)

This query assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. For example, if you wanted to add the synonym to the first car entry in the thesaurus, specify the first argument as follows:

thsr:lookup("/myThsrDocs/wordnet.xml", "car")[1]
Example: Adding a Synonym to a Thesaurus Entry in JavaScript

The following JavaScript program adds the synonym Alfa Romeo to the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml:

const thsr = require("/MarkLogic/thesaurus");
declareUpdate();

thsr.addSynonym(
  thsr.lookup("/myThsrDocs/wordnet.xml", "car"
     // requires the "elements" option because addSynonym takes an
     // element, not a JSON object
     "elements"), 
    {"synonym":{
       "term": "Alfa Romeo"}
    })

This assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. Notice also that the lookup must specify "elements" because thsr.addSynonym requires an element entry. For example, if you wanted to add the synonym to the first car entry in the thesaurus, specify the first argument using the first variable from the following code:

fn.subsequence(
  thsr.lookup("/myThsrDocs/wordnet.xml", "car"), 2, 1))
Example: Removing a Synonym From a Thesaurus in XQuery

The following XQuery removes the synonym Fiat from the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
                             at "/MarkLogic/thesaurus.xqy";

thsr:remove-synonym(thsr:lookup("/myThsrDocs/wordnet.xml", "car"), 
                 <thsr:synonym>
                    <thsr:term>Fiat</thsr:term>
                 </thsr:synonym>)

This query assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. For example, if you wanted to remove the synonym from the first car entry in the thesaurus, specify the first argument as follows:

thsr:lookup("/myThsrDocs/wordnet.xml", "car")[1]
Example: Removing a Synonym From a Thesaurus in JavaScript

The following JavaScript program removes the synonym Fiat from the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml:

const thsr = require("/MarkLogic/thesaurus");
declareUpdate();

thsr.removeSynonym(thsr.lookup("/myThsrDocs/wordnet.xml", "car",
   "elements"), 
   {"term": "Fiat"});

This query assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. For example, if you wanted to remove the synonym from the first car entry in the thesaurus, specify the first argument as follows:

fn.subsequence(
  thsr.lookup("/myThsrDocs/wordnet.xml", "car"), 2, 1))

Expanding Searches Using a Thesaurus in XQuery

You can expand a search to include terms from a thesaurus as well as the terms entered in the search. Consider the following XQuery statement:

xquery version "1.0-ml";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" 
   at "/MarkLogic/thesaurus.xqy";

cts:search(
doc("/Docs/hamlet.xml")//LINE,
thsr:expand(
  cts:word-query("weary"), 
  thsr:lookup("/myThsrDocs/thesaurus.xml", "weary"),
  (), 
  (), 
  () )
)

This query finds all of the lines in Shakespeare's Hamlet that have the word weary or any of the synonyms of the word weary.

Thesaurus entries can have many synonyms, though. Therefore, when you expand a search, you might want to create a user interface in the application which provides a form allowing a user to specify the desired synonyms from the list returned by thsr:expand. Once the user chooses which synonyms to include in the search, the application can add those terms to the search and submit it to the database.

Expanding Searches Using a Thesaurus in JavaScript

You can expand a search to include terms from a thesaurus as well as the terms entered in the search. Consider the following JavaScript program:

const thsr = require("/MarkLogic/thesaurus");

let res = [];
for (const x of cts.doc("/shakespeare/plays/hamlet.xml").xpath("//LINE")) {
if (cts.contains(x, 
                 thsr.expand(
                   cts.wordQuery("weary"), 
                   thsr.lookup("/myThsrDocs/thesaurus.xml", "weary"),
                   null, null, null ))) {
    res.push(x) } };
res;

This returns an array containing all of the lines in Shakespeare's Hamlet that have the word weary or any of the synonyms of the word weary.

Thesaurus entries can have many synonyms, though. Therefore, when you expand a search, you might want to create a user interface in the application which provides a form allowing a user to specify the desired synonyms from the list returned by thsr.expand. Once the user chooses which synonyms to include in the search, the application can add those terms to the search and submit it to the database.

« Previous chapter
Next chapter »