User-Defined Function API 11.0
marklogic::StemmerUDF Class Referenceabstract

Encapsulation of a User Defined Function for performing stemming of individual words. More...

#include <MarkLogic.h>

Public Member Functions

virtual void  close ()=0
  Release a StemmerUDF instance. More...
 
virtual bool  isStale () const =0
  Return true if the StemmerUDF instance can no longer be reused. More...
 
virtual void  initialize (const char *lang, int argc, const char **argv, Reporter &r)=0
  Initialize a stemmer. This method is called once after the stemmer is constructed. More...
 
virtual void  reset (const CodePoint *cp, const unsigned sz, Token::PartOfSpeech pos, Reporter &r)=0
  Get stems for the input word. More...
 
virtual void  start (Reporter &r)=0
  Start iteration of stems. Normal stemming operations may require the iteration to be performed more than once. Implementations should be prepared to do so. More...
 
virtual bool  next (Reporter &r)=0
  Advance to the next stem. Return true if the is another stem to fetch. More...
 
virtual const CodePointString stem ()=0
  Return the current stem. More...
 
virtual bool  delegate () const =0
  Delegate to base stemmer. Return true if the base stemmer should be asked to provide stems for the input string given to the last reset call. If the plugin stemmer returns no stems the default is to get stems from the base stemmer if there is one: returning false here will disable that and force self-stemming. Returning true here allows the plugin stemmer to get additional stems from the base stemmer if there is one.
 

Protected Member Functions

  StemmerUDF (unsigned version=MARKLOGIC_API_VERSION)
  Construct an object compatible with a specific MarkLogic Native Plugin API version. More...
 

Detailed Description

Encapsulation of a User Defined Function for performing stemming of individual words.

You must implement a subclass of this class.

When you install a subclass of StemmerUDF as a native plugin, MarkLogic servers can apply your algorithm to perform stemming of individual words. To activate your stemming algorithm for a particular language, you will need to apply the appropriate language customization configuration.

Your stemming algorithm will be applied automatically when content in the configured language is loaded into the database or searched. You can also see the effect of your algorithm from XQuery (cts:stem) or JavaScript (cts.stem).

To use your algorithm to stem words in a particular language:

  • Implement a subclass of StemmerUDF.
  • Implement the registration function markLogicPlugin.
  • Package your subclass and registration function into a native plugin.
  • Deploy the plugin to MarkLogic Server. For example, by calling plugin:install-from-zip.
  • Configure the language to use that algorithm. For example, by calling lang:language-config-write

A stemmer should provide the preferred stem, if any, as the first item in the result sequence. This is the stem that basic stemming will use. If no stems are returned in the result array, an attempt will be made to ask the base stemmer, unless the delegate method returns false. Without delegation, the input word is assumed to stem to itself. Since the stemming and tokenization used for content must match that for query strings, it is advisable for stemmers to produce alternate stems for a given string rather than relying on the part of speech to be accurate as this is unlikely to be the case for short query strings. It is better to use the part of speech for ordering of preferred strings rather than for outright pruning, as it allows advanced stemming to provide matches even when part of speech information in query strings is wrong. Alternatively, use UNSPECIFIED_POS in query strings as a signal to return all alternative stems for all possible parts of speech.

To indicate a contraction, use the character '=' at the contraction point. Example: "don't" => "do=not" To indicate a compound, use the character '#' at the compound break point. Example: "Kinderplatz" => "Kind#Platz" Upstream processing will automatically index the separate parts appropriately for the stemmed-searches setting. A word must not produce both contraction and compound stems.

For details, see TBD.

Constructor & Destructor Documentation

◆ StemmerUDF()

marklogic::StemmerUDF::StemmerUDF ( unsigned  version = MARKLOGIC_API_VERSION )
protected

Construct an object compatible with a specific MarkLogic Native Plugin API version.

You should not override the default version number.

MarkLogic Server uses the version to enforce plugin consistency across all hosts in a cluster. The API version against which your plugin is compiled must match the API version supported by the MarkLogic Server instance(s) on which your plugin executes.

For more information, see "Registering an Stemmer UDF" in the Application Developer's Guide.

Member Function Documentation

◆ close()

virtual void marklogic::StemmerUDF::close ( )
pure virtual

Release a StemmerUDF instance.

MarkLogic server calls this method when this object is no longer needed.

◆ initialize()

virtual void marklogic::StemmerUDF::initialize ( const char *  lang,
int  argc,
const char **  argv,
Reporter r 
)
pure virtual

Initialize a stemmer. This method is called once after the stemmer is constructed.

Parameters
lang The language code for the language of the text run, e.g. "fr".
argc A count of the number of arguments in argv.
argv An argument array. Populated from the argument list given when the stemmer is initially configured. It is up to the StemmerUDF implementation to parse and interpret these arguments.
r Mechanism for logging errors and other messages.

◆ isStale()

virtual bool marklogic::StemmerUDF::isStale ( ) const
pure virtual

Return true if the StemmerUDF instance can no longer be reused.

StemmerUDF objects will be kept in a pool and reused when idle. If a stemmer should not be reused, isStale should return true.

◆ next()

virtual bool marklogic::StemmerUDF::next ( Reporter r )
pure virtual

Advance to the next stem. Return true if the is another stem to fetch.

Parameters
r Mechanism for logging errors and other messages.

◆ reset()

virtual void marklogic::StemmerUDF::reset ( const CodePoint *  cp,
const unsigned  sz,
Token::PartOfSpeech  pos,
Reporter r 
)
pure virtual

Get stems for the input word.

The input is a pointer to array of Unicode codepoints and its length. The result is a sequence of alternative stems, if any, with the preferred stem given first. If no stem is returned, the string is taken as its own stem.

◆ start()

virtual void marklogic::StemmerUDF::start ( Reporter r )
pure virtual

Start iteration of stems. Normal stemming operations may require the iteration to be performed more than once. Implementations should be prepared to do so.

Parameters
r Mechanism for logging errors and other messages.

◆ stem()

virtual const CodePointString * marklogic::StemmerUDF::stem ( )
pure virtual

Return the current stem.

If there is no current stem, a null pointer should be returned. The StemmerUDF is responsible for managing the deallocation of this pointer, if necessary.


The documentation for this class was generated from the following file: