Encapsulation of a User Defined Function for performing tokenization of text runs. More...

#include <MarkLogic.h>

Public Member Functions
virtual void	close ()=0
	Release a LexerUDF instance. More...

virtual bool	isStale () const =0
	Return true if the LexerUDF instance can no longer be reused. More...

virtual void	initialize (const char lang, int argc, const char *argv, Reporter &r)=0
	Initialize a lexer. This method is called once after the lexer is constructed. More...

virtual void	reset (const CodePoint *cp, unsigned sz, Reporter &r)=0
	Initiate a new tokenization episode on a new text run. More...

virtual void	finish (Reporter &r)=0
	Clean up from tokenizing a text run. More...

virtual bool	next (Reporter &r)=0
	Advance to the next token. Return true if the is another token to fetch. More...

virtual const Token *	token () const =0
	Return the current token. More...

Protected Member Functions
	LexerUDF (unsigned version=MARKLOGIC_API_VERSION)
	Construct an object compatible with a specific MarkLogic Native Plugin API version. More...

Detailed Description

Encapsulation of a User Defined Function for performing tokenization of text runs.

You must implement a subclass of this class.

When you install a subclass of LexerUDF as a native plugin, MarkLogic servers can apply your algorithm to perform tokenization of a text run. To activate your tokenization algorithm for a particular language, you will need to apply the appropriate language customization configuration.

Your tokenization algorithm will be applied automatically when content in the configured language is loaded into the database or searched. You can also see the effect of your algorithm from XQuery (cts:tokenize) or JavaScript (cts.tokenize).

To use your algorithm to tokenize content in a particular language:

Implement a subclass of LexerUDF.
Implement the registration function markLogicPlugin.
Package your subclass and registration function into a native plugin.
Deploy the plugin to MarkLogic Server. For example, by calling plugin:install-from-zip.
Configure the language to use that algorithm. For example, by calling lang:language-config-write

A tokenizer is expected to provide a complete partition of the input array, with no gaps, overlaps, or changes to the content. The contents of documents are stored in tokenized form, so failure to abide by this rule may cause document contents to be changed.

Tokens may have an optional part of speech, for implementations that need to provide extra information to their stemmer. It is generally preferable to produce multiple alternative stems rather than attempt to be overly precise: tokenization and stemming of query strings needs to be able to produce matches for content, and part of speech information is unlikely to be accurate for short query strings.

For details, see TBD.

Constructor & Destructor Documentation

◆ LexerUDF()

marklogic::LexerUDF::LexerUDF ( unsigned version = MARKLOGIC_API_VERSION )

protected

Construct an object compatible with a specific MarkLogic Native Plugin API version.

You should not override the default version number.

MarkLogic Server uses the version to enforce plugin consistency across all hosts in a cluster. The API version against which your plugin is compiled must match the API version supported by the MarkLogic Server instance(s) on which your plugin executes.

For more information, see "Registering an Lexer UDF" in the Application Developer's Guide.

Member Function Documentation

◆ close()

virtual void marklogic::LexerUDF::close ( )

pure virtual

Release a LexerUDF instance.

MarkLogic server calls this method when this object is no longer needed.

◆ finish()

virtual void marklogic::LexerUDF::finish ( Reporter & r )

pure virtual

Clean up from tokenizing a text run.

This method is called after the last token is obtained from this text run. After this method is called, the first token should be available.

Parameters

r	Mechanism for logging errors and other messages.

◆ initialize()

virtual void marklogic::LexerUDF::initialize	(	const char *	lang,
		int	argc,
		const char **	argv,
		Reporter &	r
	)

pure virtual

Initialize a lexer. This method is called once after the lexer is constructed.

Parameters

lang	The language code for the language of the text run, e.g. "fr".
argc	A count of the number of arguments in argv.
argv	An argument array. Populated from the argument list given when the lexer is initially configured. It is up to the `LexerUDF` implementation to parse and interpret these arguments.
r	Mechanism for logging errors and other messages.

◆ isStale()

virtual bool marklogic::LexerUDF::isStale ( ) const

pure virtual

Return true if the LexerUDF instance can no longer be reused.

LexerUDF objects will be kept in a pool and reused when idle. If a lexer should not be reused, isStale should return true.

◆ next()

virtual bool marklogic::LexerUDF::next ( Reporter & r )

pure virtual

Advance to the next token. Return true if the is another token to fetch.

Parameters

r	Mechanism for logging errors and other messages.

◆ reset()

virtual void marklogic::LexerUDF::reset	(	const CodePoint *	cp,
		unsigned	sz,
		Reporter &	r
	)

pure virtual

Initiate a new tokenization episode on a new text run.

The input is a pointer to array of Unicode codepoints and its length. Token offsets will be offsets into this array.

Parameters

cp	Pointer to start of codepoint array.
sz	Length of codepoint array.
r	Mechanism for logging errors and other messages.

◆ token()

virtual const Token * marklogic::LexerUDF::token ( ) const

pure virtual

Return the current token.

If there is no current token, a null pointer should be returned. The LexerUDF is responsible for managing the deallocation of this pointer, if necessary.

The documentation for this class was generated from the following file:

MarkLogic.h

Public Member Functions

Protected Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ LexerUDF()

Member Function Documentation

◆ close()

◆ finish()

◆ initialize()

◆ isStale()

◆ next()

◆ reset()

◆ token()