User-Defined Function API 11.0
|
Encapsulation of a User Defined Function for performing stemming of individual words. More...
#include <MarkLogic.h>
Public Member Functions |
|
virtual void | close ()=0 |
Release a StemmerUDF instance. More... |
|
virtual bool | isStale () const =0 |
Return true if the StemmerUDF instance can no longer be reused. More... |
|
virtual void | initialize (const char *lang, int argc, const char **argv, Reporter &r)=0 |
Initialize a stemmer. This method is called once after the stemmer is constructed. More... |
|
virtual void | reset (const CodePoint *cp, const unsigned sz, Token::PartOfSpeech pos, Reporter &r)=0 |
Get stems for the input word. More... |
|
virtual void | start (Reporter &r)=0 |
Start iteration of stems. Normal stemming operations may require the iteration to be performed more than once. Implementations should be prepared to do so. More... |
|
virtual bool | next (Reporter &r)=0 |
Advance to the next stem. Return true if the is another stem to fetch. More... |
|
virtual const CodePointString * | stem ()=0 |
Return the current stem. More... |
|
virtual bool | delegate () const =0 |
Delegate to base stemmer. Return true if the base stemmer should be asked to provide stems for the input string given to the last reset call. If the plugin stemmer returns no stems the default is to get stems from the base stemmer if there is one: returning false here will disable that and force self-stemming. Returning true here allows the plugin stemmer to get additional stems from the base stemmer if there is one. |
|
Protected Member Functions |
|
StemmerUDF (unsigned version=MARKLOGIC_API_VERSION) | |
Construct an object compatible with a specific MarkLogic Native Plugin API version. More... |
|
Encapsulation of a User Defined Function for performing stemming of individual words.
You must implement a subclass of this class.
When you install a subclass of StemmerUDF as a native plugin, MarkLogic servers can apply your algorithm to perform stemming of individual words. To activate your stemming algorithm for a particular language, you will need to apply the appropriate language customization configuration.
Your stemming algorithm will be applied automatically when content in the configured language is loaded into the database or searched. You can also see the effect of your algorithm from XQuery (cts:stem
) or JavaScript (cts.stem
).
To use your algorithm to stem words in a particular language:
markLogicPlugin
. A stemmer should provide the preferred stem, if any, as the first item in the result sequence. This is the stem that basic
stemming will use. If no stems are returned in the result array, an attempt will be made to ask the base stemmer, unless the delegate
method returns false. Without delegation, the input word is assumed to stem to itself. Since the stemming and tokenization used for content must match that for query strings, it is advisable for stemmers to produce alternate stems for a given string rather than relying on the part of speech to be accurate as this is unlikely to be the case for short query strings. It is better to use the part of speech for ordering of preferred strings rather than for outright pruning, as it allows advanced
stemming to provide matches even when part of speech information in query strings is wrong. Alternatively, use UNSPECIFIED_POS in query strings as a signal to return all alternative stems for all possible parts of speech.
To indicate a contraction, use the character '=' at the contraction point. Example: "don't" => "do=not" To indicate a compound, use the character '#' at the compound break point. Example: "Kinderplatz" => "Kind#Platz" Upstream processing will automatically index the separate parts appropriately for the stemmed-searches
setting. A word must not produce both contraction and compound stems.
For details, see TBD.
|
protected |
Construct an object compatible with a specific MarkLogic Native Plugin API version.
You should not override the default version number.
MarkLogic Server uses the version to enforce plugin consistency across all hosts in a cluster. The API version against which your plugin is compiled must match the API version supported by the MarkLogic Server instance(s) on which your plugin executes.
For more information, see "Registering an Stemmer UDF" in the Application Developer's Guide.
|
pure virtual |
Release a StemmerUDF instance.
MarkLogic server calls this method when this object is no longer needed.
|
pure virtual |
Initialize a stemmer. This method is called once after the stemmer is constructed.
lang | The language code for the language of the text run, e.g. "fr". |
argc | A count of the number of arguments in argv. |
argv | An argument array. Populated from the argument list given when the stemmer is initially configured. It is up to the StemmerUDF implementation to parse and interpret these arguments. |
r | Mechanism for logging errors and other messages. |
|
pure virtual |
Return true if the StemmerUDF instance can no longer be reused.
StemmerUDF objects will be kept in a pool and reused when idle. If a stemmer should not be reused, isStale should return true.
|
pure virtual |
Advance to the next stem. Return true if the is another stem to fetch.
r | Mechanism for logging errors and other messages. |
|
pure virtual |
Get stems for the input word.
The input is a pointer to array of Unicode codepoints and its length. The result is a sequence of alternative stems, if any, with the preferred stem given first. If no stem is returned, the string is taken as its own stem.
|
pure virtual |
Start iteration of stems. Normal stemming operations may require the iteration to be performed more than once. Implementations should be prepared to do so.
r | Mechanism for logging errors and other messages. |
|
pure virtual |
Return the current stem.
If there is no current stem, a null pointer should be returned. The StemmerUDF is responsible for managing the deallocation of this pointer, if necessary.