MarkLogic Server includes functions that enable applications to provide thesaurus capabilities. Thesaurus applications use thesaurus (synonym) documents to find words with similar meaning to the words entered by a user. A common example application expands a user search to include words with similar meaning to those entered in a search. For example, if the application uses a thesaurus document that lists car brands as synonyms for the word car, then a search for car might return results for Alfa Romeo, Ford, and Hyundai, as well as for the word car.
This chapter describes how to use the thesaurus functions and contains the following sections:
There is an XQuery module to perform thesarus functions. You can use this module either in XQuery or in Server-Side JavaScript. The thesaurus functions are installed into the following XQuery module file:
where install_dir is the directory in which MarkLogic Server is installed. The functions in the thesaurus module use the thsr:
namespace prefix, which you must specify in your XQuery program (or specify your own namespace). To use any of the functions in XQuery, include the module and namespace declaration in the prolog of your XQuery program as follows:
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy";
To use any of the functions in a JavaScript program, include a line similar to the following in your Server-Side JavaScript program:
const thsr = require("/MarkLogic/thesaurus");
The reference information for the thesaurus module functions is included in the MarkLogic XQuery and XSLT Function Reference and the MarkLogic Server-Side JavaScript Function Reference available through docs.marklogic.com.
Any thesaurus documents loaded into MarkLogic Server must conform to the thesaurus schema, installed into the following file:
where install_dir is the directory in which MarkLogic Server is installed.
Thesaurus documents and the thesaurus functions are case-sensitive. Therefore, a thesaurus term for Car is different from a thesaurus term for car and any lookups for these terms are case-sensitive.
If you want your applications to be case-insensitive (that is, if you want the term Car to return thesaurus entries for both Car and car), your application must handle the case of the terms you want to lookup. There are several ways to handle case. For example, you can lowercase all the entries in your thesaurus documents and then lowercase the terms before performing the lookup from the thesaurus. For an example of lowercasing terms in a thesaurus document, see Lowercasing Terms When Inserting a Thesaurus Document.
You can have any number of thesaurus documents in a database. You can also add to or modify any thesaurus documents that already exist. This section describes how to load and update thesaurus documents, and contains the following sections:
To use a thesaurus in a query, use the thsr:load function or the thsr:insert function to load a document as a thesaurus. For example, to load a thesaurus document with a URI /myThsrDocs/wordnet.xml
, execute a query similar to the following:
xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; thsr:load("c:\thesaurus\wordnet.xml", "/myThsrDocs/wordnet.xml")
This XQuery adds all of the <entry>
elements from the c:\thesaurus\wordnet.xml
file to a thesaurus with the URI /myThsrDocs/wordnet.xml
. If the document already exists, then it is overwritten with the new content from the specified file.
If you have a thesaurus document that is too large to fit into an in-memory list, you can split the thesaurus into multiple documents. If you do this, you must specify all of the thesaurus documents in the thesaurus APIs that take URIs as a parameter. Also, ensure that there are no duplicate entries between the different thesaurus documents.
To use a thesaurus in a Server-Side JavaScript program, use the thsr.load function or the thsr.insert function to load a document as a thesaurus. For example, to load a thesaurus document with a URI /myThsrDocs/wordnet.xml
, execute a query similar to the following:
const thsr = require("/MarkLogic/thesaurus"); declareUpdate(); thsr.load("c:\thesaurus\wordnet.xml", "/myThsrDocs/wordnet.xml")
This JavaScript program adds all of the <entry>
elements from the c:\thesaurus\wordnet.xml
file to a thesaurus with the URI /myThsrDocs/wordnet.xml
. If the document already exists, then it is overwritten with the new content from the specified file.
If you have a thesaurus document that is too large to fit into an in-memory list, you can split the thesaurus into multiple documents. If you do this, you must specify all of the thesaurus documents in the thesaurus APIs that take URIs as a parameter. Also, ensure that there are no duplicate entries between the different thesaurus documents.
You can use the thsr:insert function to perform transformation on a document before inserting it as a thesaurus document. The following example shows how you can use the xdmp:get function to load a document into memory, then walk through the in-memory document and construct a new document which has lowercase terms.
xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; thsr:insert("newThsr.xml", let $thsrMem := xdmp:get("C:\myFiles\thesaurus.xml") return <thesaurus xmlns="http://marklogic.com/xdmp/thesaurus"> { for $entry in $thsrMem/thsr:entry return (: Write out and lowercase the term, then write out all of the children of this entry except for the term, which was already written out and lowercased :) <thsr:entry> <thsr:term>{lower-case($entry/thsr:term)}</thsr:term> {$entry/*[. ne $entry/thsr:term]} </thsr:entry> } </thesaurus> )
You can download an XML version of the WordNet from the MarkLogic Developer site (developer.marklogic.com/code/dictionaries). Once you download the thesaurus file, you can load it as a thesaurus document using the thsr:load XQuery function or the thsr.load JavaScript function.
Perform the following steps to download and load the WordNet Thesaurus:
http://developer.marklogic.com/code/dictionaries
thesaurus.xml
document.thesaurus.xml
to a file (for example, c:\thesaurus\thesaurus.xml
). Alternately, clone the GitHub repository.xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; thsr:load("c:\thesaurus\thesaurus.xml", "/myThsrDocs/wordnet.xml")
const thsr = require("/MarkLogic/thesaurus"); declareUpdate(); thsr.load("c:\thesaurus\wordnet.xml", "/myThsrDocs/wordnet.xml");This loads the thesaurus with a URI of /myThsrDocs/
wordnet.xml
. You can now use this URI with the thesaurus module functions.Use the following thesaurus functions to modify existing thesaurus documents:
XQuery Function | Server-Side JavaScript Function |
---|---|
thsr:set-entry | thsr.setEntry |
thsr:add-synonym | thsr.addSynonym |
thsr:remove-entry | thsr.removeEntry |
thsr:remove-term | thsr.removeTerm |
thsr:remove-synonym | thsr.removeSynonym |
Additionally, the thsr:insert / thsr.insert function adds entries to an existing thesaurus document (as well as creates a new one if one does not exist at the specified URI).
The transactional unit in MarkLogic Server is a query; therefore, if you are performing multiple updates to the same thesaurus document, be sure to perform those updates as part of separate queries. In XQuery, you can place a semi-colon between the update statements to start a new query (and therefore a new transaction). If you use a semicolon to start any new queries that uses thesaurus functions in XQuery, each query must include the import
statement in the prolog to resolve the thesaurus namespace.
Thesaurus documents are stored in XML format in the database. Therefore, they can be queried just like any other document. Note the following about security and thesaurus documents:
xdmp:document-set-permissions
after a thsr:load operation.This section includes the following examples, in both XQuery and JavaScript:
The following XQuery uses the thsr:set-entry function to add an entry for Car to the thesaurus with URI /myThsrDocs/wordnet.xml
:
xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; thsr:set-entry("/myThsrDocs/wordnet.xml", <entry xmlns="http://marklogic.com/xdmp/thesaurus"> <term>Car</term> <part-of-speech>noun</part-of-speech> <synonym> <term>Ford</term> <part-of-speech>noun</part-of-speech> </synonym> <synonym> <term>automobile</term> <part-of-speech>noun</part-of-speech> </synonym> <synonym> <term>Fiat</term> <part-of-speech>noun</part-of-speech> </synonym> </entry>)
If the /myThsrDocs/wordnet.xml
thesaurus has an identical entry, there will be no change to the thesaurus. If the thesaurus has no entry for car or has an entry for car that is not identical (that is, where the nodes are not equivalent), it will add the new entry. The new entry is added to the end of the thesaurus document.
The JavaScript thsr.setEntry function allows you to use a JavaScript object to update your thesaurs documents. The following JavaScript uses the thsr.setEntry function to add an entry for Car to the thesaurus with URI /myThsrDocs/wordnet.xml
:
const thsr = require("/MarkLogic/thesaurus"); declareUpdate(); thsr.setEntry("/myThsrDocs/wordnet.xml", { "term":"Car", "partOfSpeech":"noun", "synonyms":[ {"term":"Ford", "partOfSpeech":"noun" }, {"term":"automobile", "partOfSpeech":"noun" }, {"term":"Fiat", "partOfSpeech":"noun" } ] });
If the /myThsrDocs/wordnet.xml
thesaurus has an identical entry, there will be no change to the thesaurus. If the thesaurus has no entry for car or has an entry for car that is not identical (that is, where the nodes are not equivalent), it will add the new entry. The new entry is added to the end of the thesaurus document.
The following XQuery uses the thsr:remove-entry function to remove the second entry for Car from the thesaurus with URI /myThsrDocs/wordnet.xml
:
xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; thsr:remove-entry("/myThsrDocs/wordnet.xml", thsr:lookup("/myThsrDocs/wordnet.xml","Car")[2])
Similarly, the following is a JavaScript example to do the same thing:
const thsr = require("/MarkLogic/thesaurus"); declareUpdate(); thsr.removeEntry("/myThsrDocs/roget.xml", thsr.lookup("/myThsrDocs/roget.xml","Car").toObject()[1])
This removes the second Car entry from the /myThsrDocs/wordnet.xml
thesaurus document.
The following XQuery uses the thsr:remove-term function to remove all entries for the term Car from the thesaurus with URI /myThsrDocs/wordnet.xml
:
xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; thsr:remove-term("/myThsrDocs/wordnet.xml", "Car")
This removes all of the Car terms from the /myThsrDocs/wordnet.xml
thesaurus document. If you only have a single term for Car in the thesaurus, the thsr:remove-term function does the same as the thsr:remove-entry function.
The following JavaScript program uses the thsr.removeTerm function to remove all entries for the term Car from the thesaurus with URI /myThsrDocs/wordnet.xml
:
const thsr = require("/MarkLogic/thesaurus"); declareUpdate(); thsr.removeTerm("/myThsrDocs/wordnet.xml", "Car")
This removes all of the Car terms from the /myThsrDocs/wordnet.xml
thesaurus document. If you only have a single term for Car in the thesaurus, the thsr.removeTerm function does the same as the thsr.removeEntry function.
The following XQuery adds the synonym Alfa Romeo to the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml
:
xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; thsr:add-synonym(thsr:lookup("/myThsrDocs/wordnet.xml", "car"), <thsr:synonym> <thsr:term>Alfa Romeo</thsr:term> </thsr:synonym>)
This query assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. For example, if you wanted to add the synonym to the first car entry in the thesaurus, specify the first argument as follows:
thsr:lookup("/myThsrDocs/wordnet.xml", "car")[1]
The following JavaScript program adds the synonym Alfa Romeo to the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml
:
const thsr = require("/MarkLogic/thesaurus"); declareUpdate(); thsr.addSynonym( thsr.lookup("/myThsrDocs/wordnet.xml", "car" // requires the "elements" option because addSynonym takes an // element, not a JSON object "elements"), {"synonym":{ "term": "Alfa Romeo"} })
This assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. Notice also that the lookup must specify "elements"
because thsr.addSynonym requires an element entry. For example, if you wanted to add the synonym to the first car entry in the thesaurus, specify the first argument using the first
variable from the following code:
fn.subsequence( thsr.lookup("/myThsrDocs/wordnet.xml", "car"), 2, 1))
The following XQuery removes the synonym Fiat from the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml
:
xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; thsr:remove-synonym(thsr:lookup("/myThsrDocs/wordnet.xml", "car"), <thsr:synonym> <thsr:term>Fiat</thsr:term> </thsr:synonym>)
This query assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. For example, if you wanted to remove the synonym from the first car entry in the thesaurus, specify the first argument as follows:
thsr:lookup("/myThsrDocs/wordnet.xml", "car")[1]
The following JavaScript program removes the synonym Fiat from the thesaurus entry for car in the thesaurus with URI /myThsrDocs/wordnet.xml
:
const thsr = require("/MarkLogic/thesaurus"); declareUpdate(); thsr.removeSynonym(thsr.lookup("/myThsrDocs/wordnet.xml", "car", "elements"), {"term": "Fiat"});
This query assumes that the lookup for the car thesaurus entry returns a single entry. If the car lookup returns multiple entries, you must specify a single entry. For example, if you wanted to remove the synonym from the first car entry in the thesaurus, specify the first argument as follows:
fn.subsequence( thsr.lookup("/myThsrDocs/wordnet.xml", "car"), 2, 1))
You can expand a search to include terms from a thesaurus as well as the terms entered in the search. Consider the following XQuery statement:
xquery version "1.0-ml"; import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy"; cts:search( doc("/Docs/hamlet.xml")//LINE, thsr:expand( cts:word-query("weary"), thsr:lookup("/myThsrDocs/thesaurus.xml", "weary"), (), (), () ) )
This query finds all of the lines in Shakespeare's Hamlet that have the word weary or any of the synonyms of the word weary.
Thesaurus entries can have many synonyms, though. Therefore, when you expand a search, you might want to create a user interface in the application which provides a form allowing a user to specify the desired synonyms from the list returned by thsr:expand. Once the user chooses which synonyms to include in the search, the application can add those terms to the search and submit it to the database.
You can expand a search to include terms from a thesaurus as well as the terms entered in the search. Consider the following JavaScript program:
const thsr = require("/MarkLogic/thesaurus"); let res = []; for (const x of cts.doc("/shakespeare/plays/hamlet.xml").xpath("//LINE")) { if (cts.contains(x, thsr.expand( cts.wordQuery("weary"), thsr.lookup("/myThsrDocs/thesaurus.xml", "weary"), null, null, null ))) { res.push(x) } }; res;
This returns an array containing all of the lines in Shakespeare's Hamlet that have the word weary or any of the synonyms of the word weary.
Thesaurus entries can have many synonyms, though. Therefore, when you expand a search, you might want to create a user interface in the application which provides a form allowing a user to specify the desired synonyms from the list returned by thsr.expand. Once the user chooses which synonyms to include in the search, the application can add those terms to the search and submit it to the database.