Search Developer's Guide (PDF)

MarkLogic 9 Product Documentation
Search Developer's Guide
— Chapter 19

« Previous chapter
Next chapter »

Custom Dictionaries for Tokenizing and Stemming

Custom dictionaries are used to customize the way words are stemmed and tokenized for each language. This chapter describes custom dictionaries and contains the following sections:

Custom Dictionaries in MarkLogic Server

One way you can customize stemming and/or tokenization in MarkLogic is by defining a custom stemming or tokenization dictionary for a language. A given language can have at most one custom stemming dictionary and one custom tokenization dictionary. Some languages, such as Japanese and Chinese, use a single dictionary for both stemming and tokenization.

Stemming is the process of reducing a word to one or more stems. A stemming dictionary maps a word to its lemma (stem). A stemmer can use a stemming dictionary to improve the precision of a search. For example, the default stemming dictionary for English enables MarkLogic to map the words views, viewed, and viewing back to their common stem, view. To learn more about stemming, see Understanding and Using Stemmed Searches.

Tokenization is the process of partitioning text into a sequence of word, whitespace, and punctuation tokens. A tokenization dictionary identifies runs of text that should be considered words. A tokenizer can use this data to model text and split it into tokens of the appropriate types.

The following list contains some use cases for creating a custom dictionary:

  • For languages that do not tokenize based on whitespace, such as Japanese (ja), Simplified Chinese (zh), and Traditional Chinese (zh_hant), you can change the tokenizer's behavior with a custom tokenization dictionary.
  • Dictionaries for languages which tokenize based on whitespace and punction map inflections of words to their dictionary form, such as viewing mapping to view in English. The same is true of Japanese. You can use a custom dictionary to modify which words map to which stems.
  • Handling spelling variation and technical vocabulary, for words like aluminum and aluminium. Due to a dictionary entry, these two spellings are effectively the same for anything in the server based on stemming.

Custom dictionaries are validated when you install them so that errors do not occur every time you use the dictionary. Duplicate entries are not detected; such entries are unnecessary but do not cause errors. Validation does not detect non-Latin characters in a dictionary for a Latin based language such as English.

When you configure a dictionary for a language, you are also associating the dictionary with the lexer (for a tokenization dictionary) or stemmer (for a stemming dictionary) configured for that language. Each lexer or stemmer plugin has its own tokenization or stemming rules, so the modifications to those rules implied by a custom dictionary do not necessarily make sense for a different plugin.

If you change the lexer or stemmer configured for a language, you must reinstall the dictionary to update the lexer/stemmer-to-dictionary association.

You can create privileges to provide fine-grained control over who can manage the custom dictionary associated with a given stemmer or lexer plugin. For more details, see Custom Dictionary Security Considerations.

Custom dictionaries are stored in the data directory, so they survive MarkLogic server upgrades.

You should reindex if you change a custom dictionary.

Custom Dictionary Format

A custom dictionary can only be expressed as XML. A custom dictionary consists of a <dictionary/> root element with zero or more <entry/> child elements. Use the following structure for constructing a custom dictionary:

<dictionary xmlns="http://marklogic.com/xdmp/custom-dictionary">
  <entry>
    <word>wordToBeStemmed</word>
    <stem>theStem</stem>
    <pos>partOfSpeech</pos>
  </entry>
</dictionary>

The child elements of a dictionary entry have the following meaning:

Element Description
word Required. The word to be stemmed or identified as a token. The element value must not be empty.
stem Required. The stem for the word specified in <word/>. The element value must not be empty. This value is not used in tokenization dictionaries.
pos Optional. The part of speech classification of the word in <word/>. This is used primarily for languages without space-separated words, such as Chinese and Japanese. The element value must be one of the following values: Adj (adjective), Adv (adverb), Interj (interjection), Nn (noun), NN-Prop (proper noun), Verb (verb). If this element is not present, proper noun (NN-Prop) is assumed.

Stemming and tokenization dictionaries use the same format. For a tokenization dictionary, a dictionary entry effectively tells the tokenizer this is a word token.

Japanese ("ja"), Simplified Chinese ("zh"), and Traditional Chinese ("zh_Hant") use a linguistic tokenizer to divide text into tokens (words and punctuation). A custom dictionary affects the tokenizer for these languages. For Japanese, a custom dictionary also affects the stemmer. For all of these languages, a custom dictionary entry may have an optional cdict:pos element to give the part of speech for that word.

Custom Dictionary Function Summary

The custom dictionary interfaces are available to your application through the custom-dictionary XQuery library module. To use the functions in your own code, you must bring the module into scope, as shown below:

Language Example
XQuery
import module namespace cdict = 
   "http://marklogic.com/xdmp/custom-dictionary" at 
   "/MarkLogic/custom-dictionary.xqy";
Server-Side JavaScript
const cdict = require('/MarkLogic/custom-dictionary');

The dictionary library module contains functions for performing the following tasks. For more details on each function, see the MarkLogic XQuery and XSLT Function Reference or JavaScript Reference Guide.

Task Function
Insert or update a custom dictionary

XQuery: cdict:dictionary-write

JavaScript: cdict.dictionaryWrite

Retrieve a custom dictionary

XQuery: cdict:dictionary-read

JavaScript: cdict.dictionaryRead

Delete a custom dictionary

XQuery: cdict:dictionary-delete

JavaScript: cdict.dictionaryDelete

Get a list of licensed languages

XQuery: cdict:get-languages

JavaScript: cdict.getLanguages

Example: Managing a Custom Dictionary in XQuery

This section walks you through installing, updating and deleting a custom dictionary using XQuery. See the following topics for details:

Install the Dictionary

The following example installs a custom stemming dictionary for English. The dictionary contains two entries: One that specifies the stem of Furbies is Furby, and one that specifies the stem of servlets is servlet.

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
  at "/MarkLogic/custom-dictionary.xqy";
  
let $dict :=
  <cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
    <cdict:entry>
      <cdict:word>Furbies</cdict:word>
      <cdict:stem>Furby</cdict:stem>
    </cdict:entry>
    <cdict:entry>
      <cdict:word>servlets</cdict:word>
      <cdict:stem>servlet</cdict:stem>
    </cdict:entry>
  </cdict:dictionary>
return cdict:dictionary-write("en", $dict)

Since no tokenization parameter is passed to the function, the dictionary is installed as a stemming-only dictionary by default.

Modify and Update the Dictionary

The following example reads back the dictionary created in Install the Dictionary, modifies it, and updates the installed dictionary. The dictionary is modified by removing the entry for servlets and adding an entry for meetings.

To update a dictionary, you must make a copy and apply your changes to the constructed copy. You cannot use operations such as xdmp:node-replace because you are modifying an in-memory element, not a node in the database.

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
  at "/MarkLogic/custom-dictionary.xqy";

let $current-dict := cdict:dictionary-read("en")
let $new-dict :=
  element {fn:node-name($current-dict)} {
    for $entry in $current-dict//*:entry return
      if ($entry/*:word eq "servlets") then ()
      else element {fn:node-name($entry)} {
        $entry/@*,
        $entry/*
      },
    <cdict:entry xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
      <cdict:word>meeting</cdict:word>
      <cdict:stem>meeting</cdict:stem>
    </cdict:entry>
}
return cdict:dictionary-write("en", $new-dict)

If you read back the updated dictionary with cdict:dictionary-read, you should see output similar to the following:

<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
  <cdict:entry>
    <cdict:word>Furbies</cdict:word>
    <cdict:stem>Furby</cdict:stem>
  </cdict:entry>
  <cdict:entry>
    <cdict:word>meeting</cdict:word>
    <cdict:stem>meeting</cdict:stem>
  </cdict:entry>
</cdict:dictionary>

Delete the Dictionary

The following example deletes the dictionary created in Install the Dictionary.

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
  at "/MarkLogic/custom-dictionary.xqy";

cdict:dictionary-delete("en")

Calling the function again (when there is no custom dictionary installed for English) has no effect.

Example: Managing a Custom Dictionary in JavaScript

This section walks you through installing, updating and deleting a custom dictionary using Server-Side JavaScript. See the following topics for details:

Install the Dictionary

The following example installs a custom stemming dictionary for English. The dictionary contains two entries: One that specifies the stem of Furbies is Furby, and one that specifies the stem of servlets is servlet.

'use strict';
const cdict = require('/MarkLogic/custom-dictionary');

const dict = fn.head(xdmp.unquote(  
  '<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">' +
    '<cdict:entry>' +
      '<cdict:word>Furbies</cdict:word>' +
      '<cdict:stem>Furby</cdict:stem>' +
    '</cdict:entry>' +
    '<cdict:entry>' +
      '<cdict:word>servlets</cdict:word>' +
      '<cdict:stem>servlet</cdict:stem>' +
    '</cdict:entry>' +
  '</cdict:dictionary>'
)).root;
cdict.dictionaryWrite('en', dict);

Since no tokenization parameter is passed to the function, the dictionary is installed as a stemming-only dictionary by default.

Modify and Update the Dictionary

The following example reads back the dictionary created in Install the Dictionary, modifies it, and updates the installed dictionary. The dictionary is modified by removing the entry for servlets and adding an entry for meetings.

To update a dictionary, you must make a copy and apply your changes to the constructed copy. You cannot use operations such as xdmp.nodeReplace because you are modifying an in-memory element, not a node in the database.

Manipulating XML is much simpler in XQuery than in JavaScript, so you might find it easier to write dictionary data manipulation code using XQuery. The example below uses the NodeBuilder interface to create a modified copy of the dictionary in JavaScript. For an equivalent example in XQuery, see Modify and Update the Dictionary.

'use strict';
const cdict = require('/MarkLogic/custom-dictionary');

const dict = cdict.dictionaryRead('en');
const builder = new NodeBuilder();

// start a new dictionary
builder.startElement(
  'dictionary',
  'http://marklogic.com/xdmp/custom-dictionary');

// Copy all the entry elems except the one for "servlets"
for (let entry of dict.xpath('//*:entry')) {
  if (fn.data(fn.head(entry.xpath('*:word'))) != 'servlets') {
    builder.startElement(entry.localName, entry.namespaceURI);
    const entryChildren = entry.childNodes;
    for (i = 0; i < entryChildren.length; i++) {
      const child = entryChildren.item(i);
      builder.addElement(
        child.localName, child.textContent, child.namespaceURI);
    }
    builder.endElement();    // entry
  }
}

// Create a new entry for "meeting"
builder.startElement('entry', 'http://marklogic.com/xdmp/custom-dictionary');
builder.addElement('word', 'meeting', 'http://marklogic.com/xdmp/custom-dictionary');
builder.addElement('stem', 'meeting', 'http://marklogic.com/xdmp/custom-dictionary');
builder.endElement();    // entry

builder.endElement();    // dictionary

// Install the updated dictionary
cdict.dictionaryWrite('en', builder.toNode());

If you read back the updated dictionary with cdict.dictionaryRead, you should see output similar to the following:

<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
  <cdict:entry>
    <cdict:word>Furbies</cdict:word>
    <cdict:stem>Furby</cdict:stem>
  </cdict:entry>
  <cdict:entry>
    <cdict:word>meeting</cdict:word>
    <cdict:stem>meeting</cdict:stem>
  </cdict:entry>
</cdict:dictionary>

Delete the Dictionary

The following example deletes the dictionary created in Install the Dictionary.

'use strict';
const cdict = require('/MarkLogic/custom-dictionary');
cdict.dictionaryDelete('en');

Calling the function again (when there is no custom dictionary installed for English) has no effect.

Example: Exercising a Custom Dictionary

You can perform a simple test of a custom tokenization dictionary using the cts:tokenize XQuery function or the cts.tokenize JavaScript function.You can perform a simple test of a custom stemming dictionary using the cts:stem XQuery function or the cts.stem JavaScript function. You can also exercise you dictionary by performing a search against content in a configured language.

For example, suppose you have the following dictionary:

<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
  <cdict:entry>
    <cdict:word>servlets</cdict:word>
    <cdict:stem>servletti</cdict:stem>
  </cdict:entry>
</cdict:dictionary>

If you install this dictionary as a stemming dictionary for, say, French, then you can exercise it using the following code:

Language Example
XQuery
xquery version "1.0-ml";
cts:stem("servlets", "fr")
JavaScript
'use strict';
cts.stem('servlets', 'fr')

The word servlets should stem to servletti.

If you install the same dictionary as a tokenization dictionary for French, then you can exercise it using the following code:

Language Example
XQuery
xquery version "1.0-ml";
cts:tokenize("aservletse", "fr")
JavaScript
'use strict';
cts.tokenize('aservletse', 'fr')

The input should tokenize to three tokens: "a", "servlets", "e".

« Previous chapter
Next chapter »
Powered by MarkLogic Server | Terms of Use | Privacy Policy