User-Defined Function API 11.0
|
C++ representation of the MarkLogic token. More...
#include <MarkLogic.h>
Public Types |
|
typedef unsigned char | TokenType |
The kind of token being returned. More... |
|
typedef unsigned char | PartOfSpeech |
The part of speech of the token being returned. Part of speech is only relevant for word tokens. More... |
|
Public Member Functions |
|
Token (unsigned _begin, unsigned _end, TokenType _type, PartOfSpeech _pos) | |
Token (unsigned _begin, unsigned _end, TokenType _type) | |
Token (const Token &t) | |
Public Attributes |
|
unsigned | begin |
unsigned | end |
TokenType | type |
PartOfSpeech | pos |
Static Public Attributes |
|
static const TokenType | SPACE = 's' |
static const TokenType | PUNCT = 'p' |
static const TokenType | WORD = 'w' |
static const TokenType | SPECIAL = 'x' |
static const PartOfSpeech | UNSPECIFIED_POS = '\0' |
static const PartOfSpeech | NOUN_POS = 'n' |
static const PartOfSpeech | VERB_POS = 'v' |
static const PartOfSpeech | ADJECTIVE_POS = 'a' |
static const PartOfSpeech | ADVERB_POS = 'r' |
static const PartOfSpeech | PRONOUN_POS = 'p' |
static const PartOfSpeech | CONJUNCTION_POS = 'c' |
static const PartOfSpeech | DETERMINER_POS = 'd' |
static const PartOfSpeech | MISC_POS = '?' |
C++ representation of the MarkLogic token.
The offsets point to the position in the codepoint array of the first codepoint in the token and the codepoint following the last codepoint in the token.
typedef unsigned char marklogic::Token::PartOfSpeech |
The part of speech of the token being returned. Part of speech is only relevant for word tokens.
LexerUDF implementations may return tokens with the followings parts of speech. Returning UNSPECIFIED_POS for every token reduces the storage overhead and processing time. Implementations are free to return other unsigned char values in the range 0x01 to 0x7F as well as any of the values listed here, but the high bit is reserved.
UNSPECIFIED_POS: part of speech is not reported NOUN_POS: a noun, such as "laptop" VERB_POS: a verb, such as "ate" ADJECTIVE_POS: an adjective, such as "green" ADVERB_POS: an adverb, such as "rapidly" PRONOUN_POS: a pronoun, such as "her" CONJUNCTION_POS: a conjunction, such as "and" DETERMINER_POS: a determiner, such as "the" MISC_POS: some other miscellaneous part of speech
typedef unsigned char marklogic::Token::TokenType |
The kind of token being returned.
LexerUDF implementations will return tokens of the following types.
SPACE: a whitespace token, will not be indexed PUNCT: a punctuation token, will not be indexed WORD: a separate word in the current language, will be indexed SPECIAL: some other kind of token, will be indexed