User-Defined Function API 11.0
|
Encapsulation of a User Defined Function for performing tokenization of text runs. More...
#include <MarkLogic.h>
Public Member Functions |
|
virtual void | close ()=0 |
Release a LexerUDF instance. More... |
|
virtual bool | isStale () const =0 |
Return true if the LexerUDF instance can no longer be reused. More... |
|
virtual void | initialize (const char *lang, int argc, const char **argv, Reporter &r)=0 |
Initialize a lexer. This method is called once after the lexer is constructed. More... |
|
virtual void | reset (const CodePoint *cp, unsigned sz, Reporter &r)=0 |
Initiate a new tokenization episode on a new text run. More... |
|
virtual void | finish (Reporter &r)=0 |
Clean up from tokenizing a text run. More... |
|
virtual bool | next (Reporter &r)=0 |
Advance to the next token. Return true if the is another token to fetch. More... |
|
virtual const Token * | token () const =0 |
Return the current token. More... |
|
Protected Member Functions |
|
LexerUDF (unsigned version=MARKLOGIC_API_VERSION) | |
Construct an object compatible with a specific MarkLogic Native Plugin API version. More... |
|
Encapsulation of a User Defined Function for performing tokenization of text runs.
You must implement a subclass of this class.
When you install a subclass of LexerUDF as a native plugin, MarkLogic servers can apply your algorithm to perform tokenization of a text run. To activate your tokenization algorithm for a particular language, you will need to apply the appropriate language customization configuration.
Your tokenization algorithm will be applied automatically when content in the configured language is loaded into the database or searched. You can also see the effect of your algorithm from XQuery (cts:tokenize
) or JavaScript (cts.tokenize
).
To use your algorithm to tokenize content in a particular language:
markLogicPlugin
. A tokenizer is expected to provide a complete partition of the input array, with no gaps, overlaps, or changes to the content. The contents of documents are stored in tokenized form, so failure to abide by this rule may cause document contents to be changed.
Tokens may have an optional part of speech, for implementations that need to provide extra information to their stemmer. It is generally preferable to produce multiple alternative stems rather than attempt to be overly precise: tokenization and stemming of query strings needs to be able to produce matches for content, and part of speech information is unlikely to be accurate for short query strings.
For details, see TBD.
|
protected |
Construct an object compatible with a specific MarkLogic Native Plugin API version.
You should not override the default version number.
MarkLogic Server uses the version to enforce plugin consistency across all hosts in a cluster. The API version against which your plugin is compiled must match the API version supported by the MarkLogic Server instance(s) on which your plugin executes.
For more information, see "Registering an Lexer UDF" in the Application Developer's Guide.
|
pure virtual |
Release a LexerUDF instance.
MarkLogic server calls this method when this object is no longer needed.
|
pure virtual |
Clean up from tokenizing a text run.
This method is called after the last token is obtained from this text run. After this method is called, the first token should be available.
r | Mechanism for logging errors and other messages. |
|
pure virtual |
Initialize a lexer. This method is called once after the lexer is constructed.
lang | The language code for the language of the text run, e.g. "fr". |
argc | A count of the number of arguments in argv. |
argv | An argument array. Populated from the argument list given when the lexer is initially configured. It is up to the LexerUDF implementation to parse and interpret these arguments. |
r | Mechanism for logging errors and other messages. |
|
pure virtual |
|
pure virtual |
Advance to the next token. Return true if the is another token to fetch.
r | Mechanism for logging errors and other messages. |
|
pure virtual |
Initiate a new tokenization episode on a new text run.
The input is a pointer to array of Unicode codepoints and its length. Token offsets will be offsets into this array.
cp | Pointer to start of codepoint array. |
sz | Length of codepoint array. |
r | Mechanism for logging errors and other messages. |
|
pure virtual |
Return the current token.
If there is no current token, a null pointer should be returned. The LexerUDF is responsible for managing the deallocation of this pointer, if necessary.