Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 19

Custom Dictionaries for Tokenizing and Stemming

Custom dictionaries are used to customize the way words are stemmed and tokenized for each language. This chapter describes custom dictionaries and contains the following sections:

Custom Dictionaries in MarkLogic Server

If you want to change the default stemming or tokenizing behavior for any MarkLogic supported and licensed language, you can create or modify a custom dictionary. The custom dictionary entries override the built-in stemming data or add new stemming data for MarkLogic to use. For example, assume 'meeting' is mapped to stem 'meet' in the built-in dictionary. But, due to calendar entries, you don't want 'meeting' to match 'meet'. You can prevent this by adding a custom dictionary entry for the word 'meeting' with 'meeting' as a stem.

Custom dictionaries have three uses:

  • Japanese (ja), Simplified Chinese (zh), and Traditional Chinese (zh_hant) all use a linguistic tokenizer to divide text into tokens (words and punctuation) since these languages do not have spaces separating words. You can change the tokenizer's behavior by adding entries to a language's custom dictionary.
  • For all languages which tokenize based on spaces and punctuation, as well as Japanese, their dictionaries map inflections of words to their dictionary form. For example, English's dictionary maps 'views', ‘viewed', and 'viewing' all back to their common stem, 'view'. You can change, delete, or add dictionary entries to modify what words are mapped to which stems. For more information on stemming and tokenizing, see Understanding and Using Stemmed Searches.
  • Handling spelling variation and technical vocabulary, for words like 'aluminum' and 'aluminium'. Due to a dictionary entry, these two spellings are effectively the same for anything in the server based on stemming.

There is only one custom dictionary for each MarkLogic licensed and supported language. To get the custom dictionary for a language, first get a list of all of the licensed languages for your system. By iterating over this list, you can get any of your custom dictionaries. Custom dictionaries are stored in the data directory, so they survive MarkLogic server upgrades.

The custom dictionaries format and API make use of namespaces and are validated during dictionary writes to avoid runtime errors. However, MarkLogic does not check for duplicate entries which, while unnecessary, do not cause errors. Nor does it check for non-Latin characters in an English custom dictionary, although entries with such characters are invalid and not processed. Fixing errors in a custom dictionary does not require a restart.

Caution: Reindexing is required if documents are affected by custom dictionary changes. To determine what, if any, documents are affected before changing the dictionary, do a word search (not a stemmed search) for any words that your changes will add or delete. For example, if you are adding 'viewer' to stem to 'view' and deleting the 'viewing' to 'view' stem, you would want to search for both 'viewer' and 'viewing', as these are the words which are affected by changing the dictionary. 'view' itself is not affected. Documents containing those words will need reindexing after the dictionary change.

Dictionary and Entry Schemas

Custom dictionaries are stored as XML documents and use a dedicated namespace and a common entry format. Here is an example custom dictionary with a single entry:

<dictionary xmlns='http://marklogic.com/xdmp/custom-dictionary'
            xml:lang='en'>
  <entry>
    <word>servlets</word>
    <stem>servlet</stem>
    <pos>Nn</pos>
  </entry>
</dictionary>

Note that the root <dictionary> element has an xml:lang attribute specifying the dictionary's associated language, in this case English.

A dictionary can have any number of <entry> elements, which each contain three sub-elements:

  • <word>: A string, it contains a word to be stemmed.
  • <stem>: A string, the stem for the <word> string.
  • <pos>: For languages without space-separated words, such as Chinese and Japanese, it specifies a part of speech. For other languages, it is ignored. If omitted, Nn-Prop (proper noun) is the default value. The common values are:
    Value Part of Speech
    Adj Adjective
    Adv Adverb
    Interj Interjection
    Nn Noun
    NN-Prop Proper Noun (default)
    Verb Verb

Custom Dictionary Functions

To use the custom-dictionary.xqy module in your own XQuery modules, include the following line in your XQuery prolog:

import module namespace cdict = 
   "http://marklogic.com/xdmp/custom-dictionary" at 
   "/MarkLogic/custom-dictionary.xqy";

This imports the module and binds the module namespace to the custom dictionary module prefix cdict.

There are four custom dictionary related functions, as shown below. Note that you cannot edit a custom dictionary in place. Instead, you have to first get the dictionary. Then add to, edit, or delete its entries. Finally, overwrite the old version by writing the new version out.

For more details on these functions, see MarkLogic XQuery and XSLT Function Reference.

Get All Licensed Languages

cdict:get-languages()
  as xs:string*

Requires the http://marklogic.com/xdmp/privileges/custom-dictionary-user privilege. Returns the 2-letter ISO language codes for all your server's licensed languages. A list of codes and their associated languages is at Note that MarkLogic only uses the 2-letter ISO 639-1 codes, including zh's zh_Hant variant.

Get A Custom Dictionary

cdict:dictionary-read(
  $lang as xs:string
) as element(cdict:dictionary)

Requires the http://marklogic.com/xdmp/privileges/custom-dictionary-user privilege. $lang is an ISO language code. If $lang matches a licensed language with a custom dictionary, the local host returns that custom dictionary (which is the same across the cluster). The dictionary's xml:lang attribute is also returned, indicating its associated language. If $lang is not a licensed language, it raises an XDMP-LANG error.

Add/Write A Custom Dictionary

cdict:dictionary-write(
  $lang as xs:string,
  $dict as element(cdict:dictionary)
) as empty-sequence()

Requires the http://marklogic.com/xdmp/privileges/custom-dictionary-admin privilege. $lang is an ISO language code. $dict is the custom dictionary If $lang matches a licensed language and $dict validates, the cluster installs $dict and returns host IDs from where the dictionary is saved. If $lang is not a licensed language, it raises an XDMP-LANG error. If validation fails, it raises validation errors. dictionary-write ignores the xml:lang attribute.

Delete A Custom Dictionary

cdict:dictionary-delete(
  $lang as xs:string
) as empty-sequence()

Requires the http://marklogic.com/xdmp/privileges/custom-dictionary-admin privilege. $lang is an ISO language code. If $lang matches a licensed language with a custom dictionary, the dictionary is deleted and it returns the host IDs from which the dictionary was deleted. Raises an XDMP-LANG error if $lang is not a licensed language..

Usage Examples

The following code shows how to find a custom dictionary, get it, add and edit entries, write out the modified dictionary, and finally how to delete it. After each code sample is an example response.

First, get the sequence of supported and licensed languages:

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
                 at "/MarkLogic/custom-dictionary.xqy";

cdict:get-languages()

==> ("en", "ja", "zh", "zh_Hant")

Next, get the dictionary contents for a particular language, in this case English (en):

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
                at "/MarkLogic/custom-dictionary.xqy";

cdict:dictionary-read("en")

=> <cdict:dictionary 
        xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary"
        xml:lang="en">
     <cdict:entry>
        <cdict:word>Furbies</cdict:word>
        <cdict:stem>Furby</cdict:stem>
     </cdict:entry>
     <cdict:entry>
        <cdict:word>servlets</cdict:word>
        <cdict:stem>servlet</cdict:stem>
     </cdict:entry>
    </cdict:dictionary>

Put the contents in a file, for example /var/tmp/cdict-en.xml, then edit the file or modify it with XQuery to add new entries and/or delete or modify existing entries.

Next, install your modified dictionary in MarkLogic. The returned value is an empty sequence. Note that you do not have to specify where the dictionary goes, just what language to associate it with. The server knows where to put it. Since there can only be one custom dictionary for a language, this command overwrites any existing custom dictionary associated with the language argument.

xquery version "1.0-ml";
import module namespace dict = "http://marklogic.com/xdmp/custom-dictionary" 
                at "/MarkLogic/custom-dictionary.xqy";
let $dict := xdmp:document-get("/var/tmp/cdict-en.xml")/*
return 
  cdict:dictionary-write("en",$dict)
=> empty sequence

Finally, if you want to delete your English custom dictionary, you would do something similar to the following, which returns the empty sequence.

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
                at "/MarkLogic/custom-dictionary.xqy";
cdict:dictionary-delete("en")
=> empty sequence
xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
                at "/MarkLogic/custom-dictionary.xqy";
cdict:dictionary-read("en");

The previous examples use several queries, each query performing its own transaction, to read, save, and modify a custom dictionary. The following example accomplishes something similar in a single transaction, where the XQuery program reads and then modifies a custom dictionary.

(: Add an entry to the English custom dictionary :)
xquery version "1.0-ml";
import module namespace cdict =
         "http://marklogic.com/xdmp/custom-dictionary" 
              at "/MarkLogic/custom-dictionary.xqy";
  let $language := "en"
(: Get the English custom dictionary :)
  let $dictionary := cdict:dictionary-read($language)
(: Specify the new entry element as XML (not as a string) :)
  let $entry := <cdict:entry>
                  <cdict:word>views</cdict:word>
                  <cdict:stem>view</cdict:stem>
                </cdict:entry>
(: First, check if there are already any dictionary entries :)
  return if (fn:empty($dictionary)) 
(: If no entries, then we have to create a cdict:dictionary
   element and insert our new entry before writing it out :)
      then cdict:dictionary-write($language, 
                                  element cdict:dictionary { 
                                      attribute xml:lang { 
                                          $language },
                                      $entry})

(: If there are entries, just insert the new entry as a node and
   write out the dictionary :)
      else cdict:dictionary-write($language,
                                  xdmp:node-insert-child(
                                     $dictionary/dictionary,$entry))
(: Finally, test the new mapping; this should return 'view':)
cts:stem('views', 'en')
« Previous chapter
Next chapter »