Loading TOC...
Search Developer's Guide (PDF)

Search Developer's Guide — Chapter 20

Using the Spelling Correction Functions

MarkLogic Server includes functions that enable applications to provide spelling capabilities. Spelling applications use dictionary documents to find possible misspellings for words entered by a user. A common example application will prompt a user for words that might be misspelled. For example, if a user enters a search for the word albetros, an application that uses the spelling correction functions might prompt the user if she means albatross.

This chapter describes how to use the spelling correction functions and contains the following sections:

Overview of Spelling Correction

The spelling correction functions enable you to create applications that check if words are spelled correctly. It uses one or more dictionaries that you load into the database and checks words against a dictionary you specify. You can control everything about what words are in the dictionary. There are functions to manage the dictionaries, check spelling, and suggest words for misspellings.

Function Reference

The reference information for the spelling module functions is included in the MarkLogic XQuery and XSLT Function Referenceavailable through. The spelling functions are divided into the following categories:

The Spelling Built-In Functions

The spelling correction functions are built-in functions and do not require the import module statement in the XQuery prolog. The following are the spelling correction functions:

The spell:double-metaphone and spell:levenshtein-distance functions return the raw values from which spell:suggest, spell:suggest-detailed, and spell:is-correct calculate their values.

The difference between spell:suggest and spell:suggest-detailed is that spell:suggest-detailed provides some of the information used in calculating the suggestions, and it returns in in an XML representaiton (whereas spell:suggest returns a sequence of suggested words). For most spelling applications, spell:suggest is sufficient, but if you want finer control of the suggestions you provide (for example, if you want to calculate your own order of returning the suggestions), you can use spell:suggest-detailed and then filter on some of the criteria returned in its XML output.

The Spelling Dictionary Management Module Functions

There is an XQuery module to perform management of dictionary documents. The spelling correction dictionary management functions are installed into the following XQuery module file:

  • install_dir/Modules/MarkLogic/spell.xqy

where install_dir is the directory in which MarkLogic Server is installed. The functions in the spelling module use the spell: namespace prefix, which is predefined in the server. To use the functions in this module, include the module declaration in the prolog of your XQuery program as follows:

import module "http://marklogic.com/xdmp/spell" at "/MarkLogic/spell.xqy";

Dictionary Documents

Any dictionary documents loaded into MarkLogic Server must have the following basic structure:

<dictionary xmlns="http://marklogic.com/xdmp/spell">
    <metadata>
    </metadata>
    <word></word>
    <word></word>
    ......
</dictionary>

There are sample dictionary documents available on. You can use these documents or create your own dictionaries. You can also use the spell:make-dictionary spelling management function to create a dictionary document, and then use spell:load to load the dictionary into the database.

Capitalization

The spelling lookup functions (spell:is-correct, spell:suggest, and spell:suggest-detailed) are case-sensitive, so case is important for words in a dictionary. Additionally, there are some special rules to handle the first character in a spelling lookup. The following are the capitalization rules for the spelling correction functions:

  • A capital first letter in a spelling lookup query does not make the spelling incorrect for spell:is-correct. For example, Word will match an entry for word in the dictionary.
  • If a word has the first letter capitalized in the dictionary, then only exact matches will be correct for spell:is-correct. For example, if Word is in the dictionary, then word is incorrect.
  • If a word has other letters capitalized in the dictionary, then only exact matches (or exact matches except for the case of the first letter in the word) will match for spell:is-correct.. For example, word will not match an entry for woRd, nor will WOrd, but WoRd will match.
  • The spell:suggest function (and the spell:suggest-detailed function) observes the capitalization of the first letter only. For example, spell:suggest("THe") will return The, Thee, They, and so on as suggestions, while spell:suggest("tHe") will give the, thee, they, and so on. In other words, if you capitalize the first letter of the argument to the spell:suggest function, the suggestions will all begin with a capital letter. Otherwise, you will get lowercase suggestions.

If you want your applications to ignore case, then you should create a dictionary with all lowercase words and lowercase (using the XQuery fn:lower-case function, for example) the word arguments of all spell:is-correct and spell:suggest functions before submitting your queries.

Managing Dictionary Documents

You can have any number of dictionary documents in a database. You can also add to or modify any dictionary documents that already exist. This section describes how to load and update dictionary documents, and contains the following topics:

Loading Dictionary Documents

To use a dictionary in a query, use the spell:load function or the spell:insert function to load a document as a dictionary. For example, to load a dictionary document with a URI /mySpell/spell.xml, execute a query similar to the following:

xquery version "1.0-ml";
import module "http://marklogic.com/xdmp/spell" at "/MarkLogic/spell.xqy";

spell:load("c:\dictionaries\spell.xml", "/mySpell/spell.xml")

This XQuery adds all of the <word> elements from the c:\dictionaries\spell.xml file to a dictionary with the URI /mySpell/spell.xml. If the document already exists, then it is overwritten with the new content from the specified file.

Loading one of the Sample XML Dictionaries

You can download a sample dictionary from the MarkLogic Community site (). The community site links to github, which has small, medium, and large versions of the dictionary. Once you download a dictionary XML file, you can load it as a dictionary document using the spell:load function.

Perform the following steps to download and load a sample dictionary:

  1. Go to the Code page of developer.marklogic.com:
    http://developer.marklogic.com/code/#dictionaries
  2. Navigate to the dictionary document section, then click the github link:
    https://github.com/marklogic/dictionaries
  3. In the dictionaries folder, choose the small-dictionary.xml, medium-dictionary.xml, or large-dictionary.xml file (or any other dictionary docments that might be available). The large dictionary has approximately 100,000 words and is about 3 MB to download.
  4. Save <size>-dictionary.xml to a file (for example, c:\dictionaries\spell.xml).
  5. Load the dictionary with a command similar to the following:
    xquery version "1.0-ml";
    import module "http://marklogic.com/xdmp/spell" at
                  "/MarkLogic/spell.xqy";
    
    spell:load("c:\dictionaries\spell.xml", "/mySpell/spell.xml")
This loads the dictionary with a URI of /mySpell/spell.xml. You can now use this URI with the spelling correction module functions.

Updating a Dictionary Document

Use the following dictionary functions to modify existing dictionary documents:

The spell:insert function will overwrite an existing dictionary if you specify an existing dictionary document (as well as creates a new one if one does not exist at the specified URI).

The transactional unit in MarkLogic Server is a query; therefore, if you are performing multiple updates to the same dictionary document, be sure to perform those updates as part of separate queries. You can place a semi-colon between the update statements to start a new query (and therefore a new transaction). If you use a semicolon to start any new queries that uses spelling correction functions, each query must include the import statement in the prolog to resolve the spelling module.

Example: Adding a New Word to a Dictionary

The following XQuery uses the spell:add-word function to add an entry for albatross to the dictionary with URI /mySpell/Spell.xml:

xquery version "1.0-ml";
import module "http://marklogic.com/xdmp/spell" at
              "/MarkLogic/spell.xqy";

spell:add-word("/mySpell/spell.xml", "albatross")

If the /mySpell/spell.xml dictionary has an identical entry, there will be no change to the dictionary. Otherwise, an entry for albatross is added to the dictionary.

Example: Removing a Word From a Dictionary

The following XQuery uses the spell:remove-word function to remove the entry for albatross dictionary with URI /mySpell/Spell.xml:

xquery version "1.0-ml";
import module "http://marklogic.com/xdmp/spell" at
              "/MarkLogic/spell.xqy";

spell:remove-word("/mySpell/spell.xml", "albatross")

This removes the word albatross from the /mySpell/spell.xml dictionary document.

Security Considerations With Dictionary Documents

Dictionary documents are stored in XML format in the database. Therefore, they can be queried just like any other document. Note the following about security and dictionary documents:

  • By default, dictionary documents are loaded into the following collections:
    • http://marklogic.com/xdmp/documents
    • http://marklogic.com/xdmp/spell
  • Dictionary documents are loaded with the default permissions of the user who loads them. Make sure users who load dictionary documents have appropriate privileges, otherwise the documents might not have the needed permissions for reading and updating. For more information, see Setting Document Permissions in the Loading Content Into MarkLogic Server Guide.
  • If you want to control access (read and/or write) to dictionary documents beyond the default permissions with which the documents are loaded, perform an xdmp:document-set-permissions after a spell:load operation.

Testing if a Word is Spelled Correctly

You can use the spell:is-correct function test to see if a word is spelled correctly (according to the specified dictionary). Consider the following query:

spell:is-correct("/mySpell/spell.xml", "alphabet")

This query returns true because the word alphabet is spelled correctly. Now consider the following query:

spell:is-correct("/mySpell/spell.xml", "alfabet")

This query returns false because the word alfabet is not spelled correctly.

Getting Spelling Suggestions for Incorrectly Spelled Words

You can write a query which returns spelling suggestions based on words in the specified dictionary. Consider the following query:

spell:suggest("/mySpell/spell.xml", "alfabet")

This query returns the following results:

alphabet albeit alphabets aloft abet alphabeted affable alphabet's alphabetic offbeat

The results are ranked in the order, where the first word is the one most likely to be the real spelling. Your application can then prompt the user if one of the suggested words was the actual word intended.

Now consider the following query:

spell:suggest("/mySpell/spell.xml", "alphabet")

This query returns the empty sequence, indicating that the word is spelled correctly.

The spelling correction functions only provide suggestions for words that are less than 64 characters in length, and the functions only return suggestions that are less than 64 characters.

« Previous chapter
Next chapter »