Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 24

Using the Spelling Correction Functions

MarkLogic Server includes functions that enable applications to provide spelling capabilities. Spelling applications use dictionary documents to find possible misspellings for words entered by a user. A common example application will prompt a user for words that might be misspelled. For example, if a user enters a search for the word albetros, an application that uses the spelling correction functions might prompt the user if she means albatross.

This chapter describes how to use the spelling correction functions and contains the following sections:

Overview of Spelling Correction

The spelling correction functions enable you to create applications that check if words are spelled correctly. It uses one or more dictionaries that you load into the database and checks words against a dictionary you specify. You can control everything about what words are in the dictionary. There are functions to manage the dictionaries, check spelling, and suggest words for misspellings.

Function Reference

The reference information for the spelling module functions is included in the MarkLogic XQuery and XSLT Function Reference and the MarkLogic Server-Side JavaScript Function Reference available through docs.marklogic.com. The spelling functions are divided into the following categories:

The Spelling Built-In Functions

The spelling correction functions are built-in functions and do not require the import module statement in the XQuery prolog. The following are the spelling correction functions:

XQuery Function Server-Side JavaScript Function
spell:is-correct spell.isCorrect
spell:suggest spell.suggest
spell:suggest-detailed spell.suggestDetailed
spell:double-metaphone spell.doubleMetaphone
spell:levenshtein-distance spell.levenshteinDistance

The spell:double-metaphone / spell.doubleMetaphone and spell:levenshtein-distance / spell.levenshteinDistance functions return the raw values from which spell:suggest / spell.suggest, spell:suggest-detailed / spell.levenshteinDistance, and spell:is-correct / spell.isCorrect calculate their values.

The difference between spell:suggest (JavaScript spell.suggest) and spell:suggest-detailed (JavaScript spell.suggestDetailed) is that spell:suggest-detailed provides some of the information used in calculating the suggestions, and it returns a report (an XML representaiton in XQuery and an array of objects in JavaScript), whereas spell:suggest returns a sequence of suggested words. For most spelling applications, spell:suggest is sufficient, but if you want finer control of the suggestions you provide (for example, if you want to calculate your own order of returning the suggestions), you can use spell:suggest-detailed and then filter on some of the criteria returned in its XML or JSON output.

The Spelling Dictionary Management Module Functions

There is an XQuery module to perform management of dictionary documents. You can use this module in either XQuery or in Server-Side JavaScript. The spelling correction dictionary management functions are installed into the following XQuery module file:

  • install_dir/Modules/MarkLogic/spell.xqy

where install_dir is the directory in which MarkLogic Server is installed. The functions in the spelling module use the spell: namespace prefix, which is predefined in the server.

To use the functions in this module in an XQuery program, include the module declaration in the prolog of your XQuery program as follows:

import module namespace spell = "http://marklogic.com/xdmp/spell" 
   at "/MarkLogic/spell.xqy";

To use the functions in this module in a JavaScript program, include a line similar to the following in your Server-Side JavaScript program:

const spell = require("/MarkLogic/spell");

Dictionary Documents

There are two types of dictionary documents you can load into MarkLogic:

There are sample XML and JSON dictionary documents available at the following GitHub repository:

https://github.com/marklogic/dictionaries

You can use these documents or create your own dictionaries. You can also use the spell:make-dictionary / spell.makeDictionary spelling management function to create a dictionary document, and then use spell:load / spell.load to load the dictionary into the database.

XML Dictionary Document

Any XML dictionary documents loaded into MarkLogic must have the following basic structure:

<dictionary xmlns="http://marklogic.com/xdmp/spell">
    <metadata>
    </metadata>
    <word></word>
    <word></word>
    ......
</dictionary>

The <metadata> element is optional. Use spell:make-dictionary / spell.makeDictionary and spell:load / spell.load to create your own dictionary documents.

JSON Dictionary Document

Any JSON dictionary documents loaded into MarkLogic must have the following basic structure:

{
 "metadata": { ... },
 "words": ["word1", "word2", ... ]
}

The metadata property is optional. Use spell:make-dictionary / spell.makeDictionary and spell:load / spell.load to create your own dictionary documents.

Capitalization

The spelling lookup functions (spell:is-correct, spell:suggest, and spell:suggest-detailed in XQuery, spell.isCorrect, spell.suggest, and spell.suggestDetailed in JavaScript) are case-sensitive, so case is important for words in a dictionary. Additionally, there are some special rules to handle the first character in a spelling lookup. The following are the capitalization rules for the spelling correction functions:

  • A capital first letter in a spelling lookup query does not make the spelling incorrect for spell:is-correct / spell.isCorrect. For example, Word will match an entry for word in the dictionary.
  • If a word has the first letter capitalized in the dictionary, then only exact matches will be correct for spell:is-correct / spell.isCorrect. For example, if Word is in the dictionary, then word is incorrect.
  • If a word has other letters capitalized in the dictionary, then only exact matches (or exact matches except for the case of the first letter in the word) will match for spell:is-correct / spell.isCorrect. For example, word will not match an entry for woRd, nor will WOrd, but WoRd will match.
  • The spell:suggest /spell.suggest functions and the spell:suggest-detailed / spell.suggestDetailed functions all observe the capitalization of the first letter only. For example, spell:suggest("THe") will return The, Thee, They, and so on as suggestions, while spell:suggest("tHe") will give the, thee, they, and so on. In other words, if you capitalize the first letter of the argument to the spell:suggest / spell.suggest function, the suggestions will all begin with a capital letter. Otherwise, you will get lowercase suggestions.

If you want your applications to ignore case, then you should create a dictionary with all lowercase words and lowercase (using the XQuery fn:lower-case function, for example) the word arguments of all spell:is-correct / spell.isCorrect and spell:suggest / spell.suggest functions before submitting your queries.

Managing Dictionary Documents

You can have any number of dictionary documents in a database. You can also add to or modify any dictionary documents that already exist. This section describes how to load and update dictionary documents, and contains the following topics:

Loading Dictionary Documents in XQuery

To use a dictionary in a query, it must be in the database. To load a dictionary document using XQuery, use the spell:load function or the spell:insert function. For example, to load a dictionary document with a URI /mySpell/spell.xml, execute a query similar to the following:

xquery version "1.0-ml";
import module "http://marklogic.com/xdmp/spell" 
  at "/MarkLogic/spell.xqy";

spell:load("c:\dictionaries\spell.xml", "/mySpell/spell.xml")

This XQuery adds all of the <word> elements from the c:\dictionaries\spell.xml file to a dictionary with the URI /mySpell/spell.xml. If the document already exists, then it is overwritten with the new content from the specified file.

Loading Dictionary Documents in JavaScript

To use a dictionary in a query, it must be in the database. To load a dictionary document using JavaScript, use the spell.load function or the spell.insert function. For example, to load a dictionary document with a URI /mySpell/spell.json, execute a program similar to the following:

const spell = require("/MarkLogic/spell");
declareUpdate();
spell.load("c:/dictionaries/spell.json", "/mySpell/spell.json");

This loads the file at the specified path into the dictionary JSON document at the specified URI.

Loading one of the Sample XML Dictionaries

You can download a sample dictionary from the MarkLogic Community site (). The community site links to github, which has small, medium, and large versions of the dictionary. Once you download a dictionary XML file, you can load it as a dictionary document using the spell:load function.

Perform the following steps to download and load a sample dictionary:

  1. Go to the Code page of developer.marklogic.com:
    http://developer.marklogic.com/code/#dictionaries
  2. Navigate to the dictionary document section, then click the github link:
    https://github.com/marklogic/dictionaries
  3. In the dictionaries folder, choose the small-dictionary.xml, medium-dictionary.xml, or large-dictionary.xml file (or any other dictionary docments that might be available). The large dictionary has approximately 100,000 words and is about 3 MB to download. ALternately, you can choose small-dictionary.json, medium-dictionary.json, or large-dictionary.json file to load a JSON dictionary.
  4. Save <size>-dictionary.xml (or the corresponding JSON document) to a file (for example, c:\dictionaries\spell.xml).
  5. Load the dictionary with a command similar to the following:
    xquery version "1.0-ml";
    import module "http://marklogic.com/xdmp/spell" at
                  "/MarkLogic/spell.xqy";
    
    spell:load("c:\dictionaries\spell.xml", "/mySpell/spell.xml")
This loads the dictionary with a URI of /mySpell/spell.xml. You can now use this URI with the spelling correction module functions.

Updating a Dictionary Document

Use the following dictionary functions to modify existing dictionary documents:

The spell:insert XQuery function or the spell.insert JavaScript function will overwrite an existing dictionary if you specify an existing dictionary document (as well as creates a new one if one does not exist at the specified URI).

The transactional unit in MarkLogic Server is a query; therefore, if you are performing multiple updates to the same dictionary document, be sure to perform those updates as part of separate queries. In XQuery, you can place a semi-colon between the update statements to start a new query (and therefore a new transaction). If you use a semicolon to start any new queries that uses spelling correction functions in XQuery, each query must include the import statement in the prolog to resolve the spelling module.

The following topics are about updating dictionary documents:

Example: Adding a New Word to a Dictionary in XQuery

The following XQuery uses the spell:add-word function to add an entry for albatross to the dictionary with URI /mySpell/Spell.xml:

xquery version "1.0-ml";
import module "http://marklogic.com/xdmp/spell" at
              "/MarkLogic/spell.xqy";

spell:add-word("/mySpell/spell.xml", "albatross")

If the /mySpell/spell.xml dictionary has an identical entry, there will be no change to the dictionary. Otherwise, an entry for albatross is added to the dictionary.

Example: Adding a New Word to a Dictionary in JavaScript

The following JavaScript program uses the spell.addWord function to add an entry for albatross to the dictionary with URI /mySpell/Spell.json:

const spell = require("/MarkLogic/spell.xqy");
declareUpdate();

spell.addWord("/mySpell/spell.json", "albatross");

If the /mySpell/spell.json dictionary has an identical entry, there will be no change to the dictionary. Otherwise, an entry for albatross is added to the dictionary.

Example: Removing a Word From a Dictionary in XQuery

The following XQuery uses the spell:remove-word function to remove the entry for albatross dictionary with URI /mySpell/Spell.xml:

xquery version "1.0-ml";
import module "http://marklogic.com/xdmp/spell" at
              "/MarkLogic/spell.xqy";

spell:remove-word("/mySpell/spell.xml", "albatross")

This removes the word albatross from the /mySpell/spell.xml dictionary document.

Example: Removing a Word From a Dictionary in JavaScript

The following JavaScript program uses the spell.removeWord function to remove the entry for albatross dictionary with URI /mySpell/Spell.json:

const spell = require("/MarkLogic/spell.xqy");
declareUpdate();

spell.removeWord("/mySpell/spell.json", "albatross")

This removes the word albatross from the /mySpell/spell.json dictionary document.

Security Considerations With Dictionary Documents

Dictionary documents are stored in XML or JSON format in the database. Therefore, they can be queried just like any other document. Note the following about security and dictionary documents:

  • By default, dictionary documents are loaded into the following collections:
    • http://marklogic.com/xdmp/documents
    • http://marklogic.com/xdmp/spell
  • Dictionary documents are loaded with the default permissions of the user who loads them. Make sure users who load dictionary documents have appropriate privileges, otherwise the documents might not have the needed permissions for reading and updating. For more information, see Setting Document Permissions in the Loading Content Into MarkLogic Server Guide.
  • If you want to control access (read and/or write) to dictionary documents beyond the default permissions with which the documents are loaded, perform an xdmp:document-set-permissions (JavaScript xdmp.documentSetPermissions) after a spell:load / spell.load operation.

Testing if a Word is Spelled Correctly

You can use the spell:is-correct XQuery function or the spell.isCorrect JavaScript function to see if a word is spelled correctly (according to the specified dictionary). Consider the following XQuery statement:

spell:is-correct("/mySpell/spell.xml", "alphabet")

This returns true because the word alphabet is spelled correctly. The following is the equivalent in JavaScript:

spell.isCorrect("/mySpell/spell.xml", "alphabet");

Now consider the following XQuery statement:

spell:is-correct("/mySpell/spell.xml", "alfabet")

This returns false because the word alfabet is not spelled correctly. The following is the equivalent in JavaScript:

spell.isCorrect("/mySpell/spell.xml", "alfabet");

Getting Spelling Suggestions for Incorrectly Spelled Words

You can write a query which returns spelling suggestions based on words in the specified dictionary. Consider the following XQuery statement:

spell:suggest("/mySpell/spell.xml", "alfabet")

Or the equivalent JavaScript program:

spell.suggest("/mySpell/spell.xml", "alfabet");

This returns the following results:

alphabet albeit alphabets aloft abet alphabeted affable alphabet's alphabetic offbeat

The results are ranked in the order, where the first word is the one most likely to be the real spelling. Your application can then prompt the user if one of the suggested words was the actual word intended.

Now consider the following XQuery statement:

spell:suggest("/mySpell/spell.xml", "alphabet")

Or the equivalent JavaScript program:

spell.suggest("/mySpell/spell.xml", "alphabet");

This returns the empty sequence, indicating that the word is spelled correctly.

The spelling correction functions only provide suggestions for words that are less than 64 characters in length, and the functions only return suggestions that are less than 64 characters.

« Previous chapter
Next chapter »