User-Defined Function API  9.0
 All Classes Functions Typedefs Enumerations Enumerator
marklogic::LexerUDF Class Referenceabstract

Encapsulation of a User Defined Function for performing tokenization of text runs. More...

#include <MarkLogic.h>

Public Member Functions

virtual void  close ()=0
  Release a LexerUDF instance. More...
 
virtual bool  isStale () const =0
  Return true if the LexerUDF instance can no longer be reused. More...
 
virtual void  initialize (const char *lang, int argc, const char **argv, Reporter &r)=0
  Initialize a lexer. This method is called once after the lexer is constructed. More...
 
virtual void  reset (const CodePoint *cp, unsigned sz, Reporter &r)=0
  Initiate a new tokenization episode on a new text run. More...
 
virtual void  finish (Reporter &r)=0
  Clean up from tokenizing a text run. More...
 
virtual bool  next (Reporter &r)=0
  Advance to the next token. Return true if the is another token to fetch. More...
 
virtual const Token token () const =0
  Return the current token. More...
 

Protected Member Functions

  LexerUDF (unsigned version=MARKLOGIC_API_VERSION)
  Construct an object compatible with a specific MarkLogic Native Plugin API version. More...
 

Detailed Description

Encapsulation of a User Defined Function for performing tokenization of text runs.

You must implement a subclass of this class.

When you install a subclass of LexerUDF as a native plugin, MarkLogic servers can apply your algorithm to perform tokenization of a text run. To activate your tokenization algorithm for a particular language, you will need to apply the appropriate language customization configuration.

Your tokenization algorithm will be applied automatically when content in the configured language is loaded into the database or searched. You can also see the effect of your algorithm from XQuery (cts:tokenize) or JavaScript (cts.tokenize).

To use your algorithm to tokenize content in a particular language:

  • Implement a subclass of LexerUDF.
  • Implement the registration function markLogicPlugin.
  • Package your subclass and registration function into a native plugin.
  • Deploy the plugin to MarkLogic Server. For example, by calling plugin:install-from-zip.
  • Configure the language to use that algorithm. For example, by calling lang:language-config-write

A tokenizer is expected to provide a complete partition of the input array, with no gaps, overlaps, or changes to the content. The contents of documents are stored in tokenized form, so failure to abide by this rule may cause document contents to be changed.

Tokens may have an optional part of speech, for implementations that need to provide extra information to their stemmer. It is generally preferable to produce multiple alternative stems rather than attempt to be overly precise: tokenization and stemming of query strings needs to be able to produce matches for content, and part of speech information is unlikely to be accurate for short query strings.

For details, see TBD.

Constructor & Destructor Documentation

marklogic::LexerUDF::LexerUDF ( unsigned  version = MARKLOGIC_API_VERSION )
protected

Construct an object compatible with a specific MarkLogic Native Plugin API version.

You should not override the default version number.

MarkLogic Server uses the version to enforce plugin consistency across all hosts in a cluster. The API version against which your plugin is compiled must match the API version supported by the MarkLogic Server instance(s) on which your plugin executes.

For more information, see "Registering an Lexer UDF" in the Application Developer's Guide.

Member Function Documentation

virtual void marklogic::LexerUDF::close ( )
pure virtual

Release a LexerUDF instance.

MarkLogic server calls this method when this object is no longer needed.

virtual void marklogic::LexerUDF::finish ( Reporter r )
pure virtual

Clean up from tokenizing a text run.

This method is called after the last token is obtained from this text run. After this method is called, the first token should be available.

Parameters
r Mechanism for logging errors and other messages.
virtual void marklogic::LexerUDF::initialize ( const char *  lang,
int  argc,
const char **  argv,
Reporter r 
)
pure virtual

Initialize a lexer. This method is called once after the lexer is constructed.

Parameters
lang The language code for the language of the text run, e.g. "fr".
argc A count of the number of arguments in argv.
argv An argument array. Populated from the argument list given when the lexer is initially configured. It is up to the LexerUDF implementation to parse and interpret these arguments.
r Mechanism for logging errors and other messages.
virtual bool marklogic::LexerUDF::isStale ( ) const
pure virtual

Return true if the LexerUDF instance can no longer be reused.

LexerUDF objects will be kept in a pool and reused when idle. If a lexer should not be reused, isStale should return true.

virtual bool marklogic::LexerUDF::next ( Reporter r )
pure virtual

Advance to the next token. Return true if the is another token to fetch.

Parameters
r Mechanism for logging errors and other messages.
virtual void marklogic::LexerUDF::reset ( const CodePoint *  cp,
unsigned  sz,
Reporter r 
)
pure virtual

Initiate a new tokenization episode on a new text run.

The input is a pointer to array of Unicode codepoints and its length. Token offsets will be offsets into this array.

Parameters
cp Pointer to start of codepoint array.
sz Length of codepoint array.
r Mechanism for logging errors and other messages.
virtual const Token* marklogic::LexerUDF::token ( ) const
pure virtual

Return the current token.

If there is no current token, a null pointer should be returned. The LexerUDF is responsible for managing the deallocation of this pointer, if necessary.


The documentation for this class was generated from the following file: