Custom dictionaries are used to customize the way words are stemmed and tokenized for each language. This chapter describes custom dictionaries and contains the following sections:
If you want to change the default stemming or tokenizing behavior for any MarkLogic supported and licensed language, you can create or modify a custom dictionary. The custom dictionary entries override the built-in stemming data or add new stemming data for MarkLogic to use. For example, assume 'meeting' is mapped to stem 'meet' in the built-in dictionary. But, due to calendar entries, you don't want 'meeting' to match 'meet'. You can prevent this by adding a custom dictionary entry for the word 'meeting' with 'meeting' as a stem.
Custom dictionaries have three uses:
ja
), Simplified Chinese (zh
), and Traditional Chinese (zh_hant
) all use a linguistic tokenizer to divide text into tokens (words and punctuation) since these languages do not have spaces separating words. You can change the tokenizer's behavior by adding entries to a language's custom dictionary.There is only one custom dictionary for each MarkLogic licensed and supported language. To get the custom dictionary for a language, first get a list of all of the licensed languages for your system. By iterating over this list, you can get any of your custom dictionaries. Custom dictionaries are stored in the data directory, so they survive MarkLogic server upgrades.
The custom dictionaries format and API make use of namespaces and are validated during dictionary writes to avoid runtime errors. However, MarkLogic does not check for duplicate entries which, while unnecessary, do not cause errors. Nor does it check for non-Latin characters in an English custom dictionary, although entries with such characters are invalid and not processed. Fixing errors in a custom dictionary does not require a restart.
Caution: Reindexing is required if documents are affected by custom dictionary changes. To determine what, if any, documents are affected before changing the dictionary, do a word search (not a stemmed search) for any words that your changes will add or delete. For example, if you are adding 'viewer' to stem to 'view' and deleting the 'viewing' to 'view' stem, you would want to search for both 'viewer' and 'viewing', as these are the words which are affected by changing the dictionary. 'view' itself is not affected. Documents containing those words will need reindexing after the dictionary change.
Custom dictionaries are stored as XML documents and use a dedicated namespace and a common entry format. Here is an example custom dictionary with a single entry:
<dictionary xmlns='http://marklogic.com/xdmp/custom-dictionary' xml:lang='en'> <entry> <word>servlets</word> <stem>servlet</stem> <pos>Nn</pos> </entry> </dictionary>
Note that the root <dictionary>
element has an xml:lang
attribute specifying the dictionary's associated language, in this case English.
A dictionary can have any number of <entry>
elements, which each contain three sub-elements:
<word>
: A string, it contains a word to be stemmed.<stem>
: A string, the stem for the <word>
string.<pos>
: For languages without space-separated words, such as Chinese and Japanese, it specifies a part of speech. For other languages, it is ignored. If omitted, Nn-Prop
(proper noun) is the default value. The common values are:Value | Part of Speech |
---|---|
Adj | Adjective |
Adv | Adverb |
Interj | Interjection |
Nn | Noun |
NN-Prop | Proper Noun (default) |
Verb | Verb |
To use the custom-dictionary.xqy
module in your own XQuery modules, include the following line in your XQuery prolog:
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy";
This imports the module and binds the module namespace to the custom dictionary module prefix cdict
.
There are four custom dictionary related functions, as shown below. Note that you cannot edit a custom dictionary in place. Instead, you have to first get the dictionary. Then add to, edit, or delete its entries. Finally, overwrite the old version by writing the new version out.
For more details on these functions, see MarkLogic XQuery and XSLT Function Reference.
cdict:get-languages() as xs:string*
Requires the http://marklogic.com/xdmp/privileges/custom-dictionary-user
privilege. Returns the 2-letter ISO language codes for all your server's licensed languages. A list of codes and their associated languages is at Note that MarkLogic only uses the 2-letter ISO 639-1 codes, including zh
's zh_Hant
variant.
cdict:dictionary-read( $lang as xs:string ) as element(cdict:dictionary)
Requires the http://marklogic.com/xdmp/privileges/custom-dictionary-user
privilege. $lang
is an ISO language code. If $lang
matches a licensed language with a custom dictionary, the local host returns that custom dictionary (which is the same across the cluster). The dictionary's xml:lang
attribute is also returned, indicating its associated language. If $lang
is not a licensed language, it raises an XDMP-LANG
error.
cdict:dictionary-write( $lang as xs:string, $dict as element(cdict:dictionary) ) as empty-sequence()
Requires the http://marklogic.com/xdmp/privileges/custom-dictionary-admin
privilege. $lang
is an ISO language code. $dict
is the custom dictionary If $lang
matches a licensed language and $dict
validates, the cluster installs $dict
and returns host IDs from where the dictionary is saved. If $lang
is not a licensed language, it raises an XDMP-LANG
error. If validation fails, it raises validation errors. dictionary-write
ignores the xml:lang
attribute.
cdict:dictionary-delete( $lang as xs:string ) as empty-sequence()
Requires the http://marklogic.com/xdmp/privileges/custom-dictionary-admin
privilege. $lang
is an ISO language code. If $lang
matches a licensed language with a custom dictionary, the dictionary is deleted and it returns the host IDs from which the dictionary was deleted. Raises an XDMP-LANG
error if $lang
is not a licensed language..
The following code shows how to find a custom dictionary, get it, add and edit entries, write out the modified dictionary, and finally how to delete it. After each code sample is an example response.
First, get the sequence of supported and licensed languages:
xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; cdict:get-languages() ==> ("en", "ja", "zh", "zh_Hant")
Next, get the dictionary contents for a particular language, in this case English (en
):
xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; cdict:dictionary-read("en") => <cdict:dictionary xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary" xml:lang="en"> <cdict:entry> <cdict:word>Furbies</cdict:word> <cdict:stem>Furby</cdict:stem> </cdict:entry> <cdict:entry> <cdict:word>servlets</cdict:word> <cdict:stem>servlet</cdict:stem> </cdict:entry> </cdict:dictionary>
Put the contents in a file, for example /var/tmp/cdict-en.xml
, then edit the file or modify it with XQuery to add new entries and/or delete or modify existing entries.
Next, install your modified dictionary in MarkLogic. The returned value is an empty sequence. Note that you do not have to specify where the dictionary goes, just what language to associate it with. The server knows where to put it. Since there can only be one custom dictionary for a language, this command overwrites any existing custom dictionary associated with the language argument.
xquery version "1.0-ml"; import module namespace dict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; let $dict := xdmp:document-get("/var/tmp/cdict-en.xml")/* return cdict:dictionary-write("en",$dict) => empty sequence
Finally, if you want to delete your English custom dictionary, you would do something similar to the following, which returns the empty sequence.
xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; cdict:dictionary-delete("en") => empty sequence xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; cdict:dictionary-read("en");
The previous examples use several queries, each query performing its own transaction, to read, save, and modify a custom dictionary. The following example accomplishes something similar in a single transaction, where the XQuery program reads and then modifies a custom dictionary.
(: Add an entry to the English custom dictionary :) xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; let $language := "en" (: Get the English custom dictionary :) let $dictionary := cdict:dictionary-read($language) (: Specify the new entry element as XML (not as a string) :) let $entry := <cdict:entry> <cdict:word>views</cdict:word> <cdict:stem>view</cdict:stem> </cdict:entry> (: First, check if there are already any dictionary entries :) return if (fn:empty($dictionary)) (: If no entries, then we have to create a cdict:dictionary element and insert our new entry before writing it out :) then cdict:dictionary-write($language, element cdict:dictionary { attribute xml:lang { $language }, $entry}) (: If there are entries, just insert the new entry as a node and write out the dictionary :) else cdict:dictionary-write($language, xdmp:node-insert-child( $dictionary/dictionary,$entry)) (: Finally, test the new mapping; this should return 'view':) cts:stem('views', 'en')