Custom dictionaries are used to customize the way words are stemmed and tokenized for each language. This chapter describes custom dictionaries and contains the following sections:
One way you can customize stemming and/or tokenization in MarkLogic is by defining a custom stemming or tokenization dictionary for a language. A given language can have at most one custom stemming dictionary and one custom tokenization dictionary. Some languages, such as Japanese and Chinese, use a single dictionary for both stemming and tokenization.
Stemming is the process of reducing a word to one or more stems. A stemming dictionary maps a word to its lemma (stem). A stemmer can use a stemming dictionary to improve the precision of a search. For example, the default stemming dictionary for English enables MarkLogic to map the words views, viewed, and viewing back to their common stem, view. To learn more about stemming, see Understanding and Using Stemmed Searches.
Tokenization is the process of partitioning text into a sequence of word, whitespace, and punctuation tokens. A tokenization dictionary identifies runs of text that should be considered words. A tokenizer can use this data to model text and split it into tokens of the appropriate types.
The following list contains some use cases for creating a custom dictionary:
ja
), Simplified Chinese (zh
), and Traditional Chinese (zh_hant
), you can change the tokenizer's behavior with a custom tokenization dictionary.Custom dictionaries are validated when you install them so that errors do not occur every time you use the dictionary. Duplicate entries are not detected; such entries are unnecessary but do not cause errors. Validation does not detect non-Latin characters in a dictionary for a Latin based language such as English.
When you configure a dictionary for a language, you are also associating the dictionary with the lexer (for a tokenization dictionary) or stemmer (for a stemming dictionary) configured for that language. Each lexer or stemmer plugin has its own tokenization or stemming rules, so the modifications to those rules implied by a custom dictionary do not necessarily make sense for a different plugin.
If you change the lexer or stemmer configured for a language, you must reinstall the dictionary to update the lexer/stemmer-to-dictionary association.
You can create privileges to provide fine-grained control over who can manage the custom dictionary associated with a given stemmer or lexer plugin. For more details, see Custom Dictionary Security Considerations.
Custom dictionaries are stored in the data directory, so they survive MarkLogic server upgrades.
A custom dictionary can only be expressed as XML. A custom dictionary consists of a <dictionary/>
root element with zero or more <entry/>
child elements. Use the following structure for constructing a custom dictionary:
<dictionary xmlns="http://marklogic.com/xdmp/custom-dictionary"> <entry> <word>wordToBeStemmed</word> <stem>theStem</stem> <pos>partOfSpeech</pos> </entry> </dictionary>
The child elements of a dictionary entry have the following meaning:
Stemming and tokenization dictionaries use the same format. For a tokenization dictionary, a dictionary entry effectively tells the tokenizer this is a word token.
Japanese ("ja"), Simplified Chinese ("zh"), and Traditional Chinese ("zh_Hant") use a linguistic tokenizer to divide text into tokens (words and punctuation). A custom dictionary affects the tokenizer for these languages. For Japanese, a custom dictionary also affects the stemmer. For all of these languages, a custom dictionary entry may have an optional cdict:pos
element to give the part of speech for that word.
The custom dictionary interfaces are available to your application through the custom-dictionary
XQuery library module. To use the functions in your own code, you must bring the module into scope, as shown below:
The dictionary library module contains functions for performing the following tasks. For more details on each function, see the MarkLogic XQuery and XSLT Function Reference or JavaScript Reference Guide.
Task | Function |
---|---|
Insert or update a custom dictionary | XQuery: cdict:dictionary-write JavaScript: cdict.dictionaryWrite |
Retrieve a custom dictionary | XQuery: cdict:dictionary-read JavaScript: cdict.dictionaryRead |
Delete a custom dictionary | XQuery: cdict:dictionary-delete JavaScript: cdict.dictionaryDelete |
Get a list of licensed languages | XQuery: cdict:get-languages JavaScript: cdict.getLanguages |
This section walks you through installing, updating and deleting a custom dictionary using XQuery. See the following topics for details:
The following example installs a custom stemming dictionary for English. The dictionary contains two entries: One that specifies the stem of Furbies is Furby, and one that specifies the stem of servlets is servlet.
xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; let $dict := <cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary"> <cdict:entry> <cdict:word>Furbies</cdict:word> <cdict:stem>Furby</cdict:stem> </cdict:entry> <cdict:entry> <cdict:word>servlets</cdict:word> <cdict:stem>servlet</cdict:stem> </cdict:entry> </cdict:dictionary> return cdict:dictionary-write("en", $dict)
Since no tokenization
parameter is passed to the function, the dictionary is installed as a stemming-only dictionary by default.
The following example reads back the dictionary created in Install the Dictionary, modifies it, and updates the installed dictionary. The dictionary is modified by removing the entry for servlets and adding an entry for meetings.
To update a dictionary, you must make a copy and apply your changes to the constructed copy. You cannot use operations such as xdmp:node-replace because you are modifying an in-memory element, not a node in the database.
xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; let $current-dict := cdict:dictionary-read("en") let $new-dict := element {fn:node-name($current-dict)} { for $entry in $current-dict//*:entry return if ($entry/*:word eq "servlets") then () else element {fn:node-name($entry)} { $entry/@*, $entry/* }, <cdict:entry xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary"> <cdict:word>meeting</cdict:word> <cdict:stem>meeting</cdict:stem> </cdict:entry> } return cdict:dictionary-write("en", $new-dict)
If you read back the updated dictionary with cdict:dictionary-read, you should see output similar to the following:
<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary"> <cdict:entry> <cdict:word>Furbies</cdict:word> <cdict:stem>Furby</cdict:stem> </cdict:entry> <cdict:entry> <cdict:word>meeting</cdict:word> <cdict:stem>meeting</cdict:stem> </cdict:entry> </cdict:dictionary>
The following example deletes the dictionary created in Install the Dictionary.
xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; cdict:dictionary-delete("en")
Calling the function again (when there is no custom dictionary installed for English) has no effect.
This section walks you through installing, updating and deleting a custom dictionary using Server-Side JavaScript. See the following topics for details:
The following example installs a custom stemming dictionary for English. The dictionary contains two entries: One that specifies the stem of Furbies is Furby, and one that specifies the stem of servlets is servlet.
'use strict'; const cdict = require('/MarkLogic/custom-dictionary'); const dict = fn.head(xdmp.unquote( '<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">' + '<cdict:entry>' + '<cdict:word>Furbies</cdict:word>' + '<cdict:stem>Furby</cdict:stem>' + '</cdict:entry>' + '<cdict:entry>' + '<cdict:word>servlets</cdict:word>' + '<cdict:stem>servlet</cdict:stem>' + '</cdict:entry>' + '</cdict:dictionary>' )).root; cdict.dictionaryWrite('en', dict);
Since no tokenization
parameter is passed to the function, the dictionary is installed as a stemming-only dictionary by default.
The following example reads back the dictionary created in Install the Dictionary, modifies it, and updates the installed dictionary. The dictionary is modified by removing the entry for servlets and adding an entry for meetings.
To update a dictionary, you must make a copy and apply your changes to the constructed copy. You cannot use operations such as xdmp.nodeReplace because you are modifying an in-memory element, not a node in the database.
Manipulating XML is much simpler in XQuery than in JavaScript, so you might find it easier to write dictionary data manipulation code using XQuery. The example below uses the NodeBuilder
interface to create a modified copy of the dictionary in JavaScript. For an equivalent example in XQuery, see Modify and Update the Dictionary.
'use strict'; const cdict = require('/MarkLogic/custom-dictionary'); const dict = cdict.dictionaryRead('en'); const builder = new NodeBuilder(); // start a new dictionary builder.startElement( 'dictionary', 'http://marklogic.com/xdmp/custom-dictionary'); // Copy all the entry elems except the one for "servlets" for (let entry of dict.xpath('//*:entry')) { if (fn.data(fn.head(entry.xpath('*:word'))) != 'servlets') { builder.startElement(entry.localName, entry.namespaceURI); const entryChildren = entry.childNodes; for (i = 0; i < entryChildren.length; i++) { const child = entryChildren.item(i); builder.addElement( child.localName, child.textContent, child.namespaceURI); } builder.endElement(); // entry } } // Create a new entry for "meeting" builder.startElement('entry', 'http://marklogic.com/xdmp/custom-dictionary'); builder.addElement('word', 'meeting', 'http://marklogic.com/xdmp/custom-dictionary'); builder.addElement('stem', 'meeting', 'http://marklogic.com/xdmp/custom-dictionary'); builder.endElement(); // entry builder.endElement(); // dictionary // Install the updated dictionary cdict.dictionaryWrite('en', builder.toNode());
If you read back the updated dictionary with cdict.dictionaryRead, you should see output similar to the following:
<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary"> <cdict:entry> <cdict:word>Furbies</cdict:word> <cdict:stem>Furby</cdict:stem> </cdict:entry> <cdict:entry> <cdict:word>meeting</cdict:word> <cdict:stem>meeting</cdict:stem> </cdict:entry> </cdict:dictionary>
The following example deletes the dictionary created in Install the Dictionary.
'use strict'; const cdict = require('/MarkLogic/custom-dictionary'); cdict.dictionaryDelete('en');
Calling the function again (when there is no custom dictionary installed for English) has no effect.
You can perform a simple test of a custom tokenization dictionary using the cts:tokenize XQuery function or the cts.tokenize JavaScript function.You can perform a simple test of a custom stemming dictionary using the cts:stem XQuery function or the cts.stem JavaScript function. You can also exercise you dictionary by performing a search against content in a configured language.
For example, suppose you have the following dictionary:
<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary"> <cdict:entry> <cdict:word>servlets</cdict:word> <cdict:stem>servletti</cdict:stem> </cdict:entry> </cdict:dictionary>
If you install this dictionary as a stemming dictionary for, say, French, then you can exercise it using the following code:
Language | Example |
---|---|
XQuery | xquery version "1.0-ml"; cts:stem("servlets", "fr") |
JavaScript | 'use strict'; cts.stem('servlets', 'fr') |
The word servlets should stem to servletti.
If you install the same dictionary as a tokenization dictionary for French, then you can exercise it using the following code:
Language | Example |
---|---|
XQuery | xquery version "1.0-ml"; cts:tokenize("aservletse", "fr") |
JavaScript | 'use strict'; cts.tokenize('aservletse', 'fr') |
The input should tokenize to three tokens: "a", "servlets", "e".