Search Developer's Guide — Chapter 19

Custom Dictionaries for Tokenizing and Stemming

Custom dictionaries are used to customize the way words are stemmed and tokenized for each language. This chapter describes custom dictionaries and contains the following sections:

Custom Dictionaries in MarkLogic Server
Custom Dictionary Format
Custom Dictionary Function Summary
Example: Managing a Custom Dictionary in XQuery
Example: Managing a Custom Dictionary in JavaScript
Example: Exercising a Custom Dictionary

Custom Dictionaries in MarkLogic Server

One way you can customize stemming and/or tokenization in MarkLogic is by defining a custom stemming or tokenization dictionary for a language. A given language can have at most one custom stemming dictionary and one custom tokenization dictionary. Some languages, such as Japanese and Chinese, use a single dictionary for both stemming and tokenization.

Stemming is the process of reducing a word to one or more stems. A stemming dictionary maps a word to its lemma (stem). A stemmer can use a stemming dictionary to improve the precision of a search. For example, the default stemming dictionary for English enables MarkLogic to map the words views, viewed, and viewing back to their common stem, view. To learn more about stemming, see Understanding and Using Stemmed Searches.

Tokenization is the process of partitioning text into a sequence of word, whitespace, and punctuation tokens. A tokenization dictionary identifies runs of text that should be considered words. A tokenizer can use this data to model text and split it into tokens of the appropriate types.

The following list contains some use cases for creating a custom dictionary:

For languages that do not tokenize based on whitespace, such as Japanese (ja), Simplified Chinese (zh), and Traditional Chinese (zh_hant), you can change the tokenizer's behavior with a custom tokenization dictionary.
Dictionaries for languages which tokenize based on whitespace and punctuation map inflections of words to their dictionary form, such as viewing mapping to view in English. The same is true of Japanese. You can use a custom dictionary to modify which words map to which stems.
Handling spelling variation and technical vocabulary, for words like aluminum and aluminium. Due to a dictionary entry, these two spellings are effectively the same for anything in the server based on stemming.

Custom dictionaries are validated when you install them so that errors do not occur every time you use the dictionary. Duplicate entries are not detected; such entries are unnecessary but do not cause errors. Validation does not detect non-Latin characters in a dictionary for a Latin based language such as English.

When you configure a dictionary for a language, you are also associating the dictionary with the lexer (for a tokenization dictionary) or stemmer (for a stemming dictionary) configured for that language. Each lexer or stemmer plugin has its own tokenization or stemming rules, so the modifications to those rules implied by a custom dictionary do not necessarily make sense for a different plugin.

If you change the lexer or stemmer configured for a language, you must reinstall the dictionary to update the lexer/stemmer-to-dictionary association.

You can create privileges to provide fine-grained control over who can manage the custom dictionary associated with a given stemmer or lexer plugin. For more details, see Custom Dictionary Security Considerations.

Custom dictionaries are stored in the data directory, so they survive MarkLogic server upgrades.

You should reindex if you change a custom dictionary.

Custom Dictionary Format

A custom dictionary can only be expressed as XML. A custom dictionary consists of a <dictionary/> root element with zero or more <entry/> child elements. Use the following structure for constructing a custom dictionary:

<dictionary xmlns="http://marklogic.com/xdmp/custom-dictionary">
  <entry>
    <word>wordToBeStemmed</word>
    <stem>theStem</stem>
    <pos>partOfSpeech</pos>
  </entry>
</dictionary>

The child elements of a dictionary entry have the following meaning:

Element	Description
word	Required. The word to be stemmed or identified as a token. The element value must not be empty.
stem	Required. The stem for the word specified in `<word/>`. The element value must not be empty. This value is not used in tokenization dictionaries.
pos	Optional. The part of speech classification of the word in `<word/>`. This is used primarily for languages without space-separated words, such as Chinese and Japanese. The element value must be one of the following values: `Adj` (adjective), `Adv` (adverb), `Interj` (interjection), `Nn` (noun), `NN-Prop` (proper noun), `Verb` (verb). If this element is not present, proper noun (`NN-Prop`) is assumed.

Stemming and tokenization dictionaries use the same format. For a tokenization dictionary, a dictionary entry effectively tells the tokenizer this is a word token.

Japanese ("ja"), Simplified Chinese ("zh"), and Traditional Chinese ("zh_Hant") use a linguistic tokenizer to divide text into tokens (words and punctuation). A custom dictionary affects the tokenizer for these languages. For Japanese, a custom dictionary also affects the stemmer. For all of these languages, a custom dictionary entry may have an optional cdict:pos element to give the part of speech for that word.

Custom Dictionary Function Summary

The custom dictionary interfaces are available to your application through the custom-dictionary XQuery library module. To use the functions in your own code, you must bring the module into scope, as shown below:

Language	Example
XQuery	import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy";
Server-Side JavaScript	const cdict = require('/MarkLogic/custom-dictionary');

The dictionary library module contains functions for performing the following tasks. For more details on each function, see the MarkLogic XQuery and XSLT Function Reference or JavaScript Reference Guide.

Task	Function
Insert or update a custom dictionary	XQuery: cdict:dictionary-write JavaScript: cdict.dictionaryWrite
Retrieve a custom dictionary	XQuery: cdict:dictionary-read JavaScript: cdict.dictionaryRead
Delete a custom dictionary	XQuery: cdict:dictionary-delete JavaScript: cdict.dictionaryDelete
Get a list of licensed languages	XQuery: cdict:get-languages JavaScript: cdict.getLanguages

Example: Managing a Custom Dictionary in XQuery

This section walks you through installing, updating and deleting a custom dictionary using XQuery. See the following topics for details:

Install the Dictionary
Modify and Update the Dictionary
Delete the Dictionary

Install the Dictionary

The following example installs a custom stemming dictionary for English. The dictionary contains two entries: One that specifies the stem of Furbies is Furby, and one that specifies the stem of servlets is servlet.

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
  at "/MarkLogic/custom-dictionary.xqy";
  
let $dict :=
  <cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
    <cdict:entry>
      <cdict:word>Furbies</cdict:word>
      <cdict:stem>Furby</cdict:stem>
    </cdict:entry>
    <cdict:entry>
      <cdict:word>servlets</cdict:word>
      <cdict:stem>servlet</cdict:stem>
    </cdict:entry>
  </cdict:dictionary>
return cdict:dictionary-write("en", $dict)

Since no tokenization parameter is passed to the function, the dictionary is installed as a stemming-only dictionary by default.

Modify and Update the Dictionary

The following example reads back the dictionary created in Install the Dictionary, modifies it, and updates the installed dictionary. The dictionary is modified by removing the entry for servlets and adding an entry for meetings.

To update a dictionary, you must make a copy and apply your changes to the constructed copy. You cannot use operations such as xdmp:node-replace because you are modifying an in-memory element, not a node in the database.

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
  at "/MarkLogic/custom-dictionary.xqy";

let $current-dict := cdict:dictionary-read("en")
let $new-dict :=
  element {fn:node-name($current-dict)} {
    for $entry in $current-dict//*:entry return
      if ($entry/*:word eq "servlets") then ()
      else element {fn:node-name($entry)} {
        $entry/@*,
        $entry/*
      },
    <cdict:entry xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
      <cdict:word>meeting</cdict:word>
      <cdict:stem>meeting</cdict:stem>
    </cdict:entry>
}
return cdict:dictionary-write("en", $new-dict)

If you read back the updated dictionary with cdict:dictionary-read, you should see output similar to the following:

<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
  <cdict:entry>
    <cdict:word>Furbies</cdict:word>
    <cdict:stem>Furby</cdict:stem>
  </cdict:entry>
  <cdict:entry>
    <cdict:word>meeting</cdict:word>
    <cdict:stem>meeting</cdict:stem>
  </cdict:entry>
</cdict:dictionary>

Delete the Dictionary

The following example deletes the dictionary created in Install the Dictionary.

xquery version "1.0-ml";
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
  at "/MarkLogic/custom-dictionary.xqy";

cdict:dictionary-delete("en")

Calling the function again (when there is no custom dictionary installed for English) has no effect.

Example: Managing a Custom Dictionary in JavaScript

This section walks you through installing, updating and deleting a custom dictionary using Server-Side JavaScript. See the following topics for details:

Install the Dictionary
Modify and Update the Dictionary
Delete the Dictionary

Install the Dictionary

'use strict';
const cdict = require('/MarkLogic/custom-dictionary');

const dict = fn.head(xdmp.unquote(  
  '<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">' +
    '<cdict:entry>' +
      '<cdict:word>Furbies</cdict:word>' +
      '<cdict:stem>Furby</cdict:stem>' +
    '</cdict:entry>' +
    '<cdict:entry>' +
      '<cdict:word>servlets</cdict:word>' +
      '<cdict:stem>servlet</cdict:stem>' +
    '</cdict:entry>' +
  '</cdict:dictionary>'
)).root;
cdict.dictionaryWrite('en', dict);

Since no tokenization parameter is passed to the function, the dictionary is installed as a stemming-only dictionary by default.

Modify and Update the Dictionary

To update a dictionary, you must make a copy and apply your changes to the constructed copy. You cannot use operations such as xdmp.nodeReplace because you are modifying an in-memory element, not a node in the database.

Manipulating XML is much simpler in XQuery than in JavaScript, so you might find it easier to write dictionary data manipulation code using XQuery. The example below uses the NodeBuilder interface to create a modified copy of the dictionary in JavaScript. For an equivalent example in XQuery, see Modify and Update the Dictionary.

'use strict';
const cdict = require('/MarkLogic/custom-dictionary');

const dict = cdict.dictionaryRead('en');
const builder = new NodeBuilder();

// start a new dictionary
builder.startElement(
  'dictionary',
  'http://marklogic.com/xdmp/custom-dictionary');

// Copy all the entry elems except the one for "servlets"
for (let entry of dict.xpath('//*:entry')) {
  if (fn.data(fn.head(entry.xpath('*:word'))) != 'servlets') {
    builder.startElement(entry.localName, entry.namespaceURI);
    const entryChildren = entry.childNodes;
    for (i = 0; i < entryChildren.length; i++) {
      const child = entryChildren.item(i);
      builder.addElement(
        child.localName, child.textContent, child.namespaceURI);
    }
    builder.endElement();    // entry
  }
}

// Create a new entry for "meeting"
builder.startElement('entry', 'http://marklogic.com/xdmp/custom-dictionary');
builder.addElement('word', 'meeting', 'http://marklogic.com/xdmp/custom-dictionary');
builder.addElement('stem', 'meeting', 'http://marklogic.com/xdmp/custom-dictionary');
builder.endElement();    // entry

builder.endElement();    // dictionary

// Install the updated dictionary
cdict.dictionaryWrite('en', builder.toNode());

If you read back the updated dictionary with cdict.dictionaryRead, you should see output similar to the following:

<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
  <cdict:entry>
    <cdict:word>Furbies</cdict:word>
    <cdict:stem>Furby</cdict:stem>
  </cdict:entry>
  <cdict:entry>
    <cdict:word>meeting</cdict:word>
    <cdict:stem>meeting</cdict:stem>
  </cdict:entry>
</cdict:dictionary>

Delete the Dictionary

The following example deletes the dictionary created in Install the Dictionary.

'use strict';
const cdict = require('/MarkLogic/custom-dictionary');
cdict.dictionaryDelete('en');

Calling the function again (when there is no custom dictionary installed for English) has no effect.

Example: Exercising a Custom Dictionary

You can perform a simple test of a custom tokenization dictionary using the cts:tokenize XQuery function or the cts.tokenize JavaScript function.You can perform a simple test of a custom stemming dictionary using the cts:stem XQuery function or the cts.stem JavaScript function. You can also exercise you dictionary by performing a search against content in a configured language.

For example, suppose you have the following dictionary:

<cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary">
  <cdict:entry>
    <cdict:word>servlets</cdict:word>
    <cdict:stem>servletti</cdict:stem>
  </cdict:entry>
</cdict:dictionary>

If you install this dictionary as a stemming dictionary for, say, French, then you can exercise it using the following code:

Language	Example
XQuery	xquery version "1.0-ml"; cts:stem("servlets", "fr")
JavaScript	'use strict'; cts.stem('servlets', 'fr')

The word servlets should stem to servletti.

If you install the same dictionary as a tokenization dictionary for French, then you can exercise it using the following code:

Language	Example
XQuery	xquery version "1.0-ml"; cts:tokenize("aservletse", "fr")
JavaScript	'use strict'; cts.tokenize('aservletse', 'fr')

The input should tokenize to three tokens: "a", "servlets", "e".

« Previous chapter

Next chapter »

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold

Telerik

Kendo UI

Fiddler

Test Studio

MOVEit

WS_FTP

Search Developer's Guide — Chapter 19