User-Defined Function API  9.0
 All Classes Functions Typedefs Enumerations Enumerator
marklogic::Token Class Reference

C++ representation of the MarkLogic token. More...

#include <MarkLogic.h>

Public Types

typedef unsigned char  TokenType
  The kind of token being returned. More...
 
typedef unsigned char  PartOfSpeech
  The part of speech of the token being returned. Part of speech is only relevant for word tokens. More...
 

Public Member Functions

  Token (unsigned _begin, unsigned _end, TokenType _type, PartOfSpeech _pos)
 
  Token (unsigned _begin, unsigned _end, TokenType _type)
 
  Token (const Token &t)
 

Public Attributes

unsigned  begin
 
unsigned  end
 
TokenType  type
 
PartOfSpeech  pos
 

Static Public Attributes

static const TokenType  SPACE = 's'
 
static const TokenType  PUNCT = 'p'
 
static const TokenType  WORD = 'w'
 
static const TokenType  SPECIAL = 'x'
 
static const PartOfSpeech  UNSPECIFIED_POS = '\0'
 
static const PartOfSpeech  NOUN_POS = 'n'
 
static const PartOfSpeech  VERB_POS = 'v'
 
static const PartOfSpeech  ADJECTIVE_POS = 'a'
 
static const PartOfSpeech  ADVERB_POS = 'r'
 
static const PartOfSpeech  PRONOUN_POS = 'p'
 
static const PartOfSpeech  CONJUNCTION_POS = 'c'
 
static const PartOfSpeech  DETERMINER_POS = 'd'
 
static const PartOfSpeech  MISC_POS = '?'
 

Detailed Description

C++ representation of the MarkLogic token.

The offsets point to the position in the codepoint array of the first codepoint in the token and the codepoint following the last codepoint in the token.

Member Typedef Documentation

typedef unsigned char marklogic::Token::PartOfSpeech

The part of speech of the token being returned. Part of speech is only relevant for word tokens.

LexerUDF implementations may return tokens with the followings parts of speech. Returning UNSPECIFIED_POS for every token reduces the storage overhead and processing time. Implementations are free to return other unsigned char values in the range 0x01 to 0x7F as well as any of the values listed here, but the high bit is reserved.

UNSPECIFIED_POS: part of speech is not reported NOUN_POS: a noun, such as "laptop" VERB_POS: a verb, such as "ate" ADJECTIVE_POS: an adjective, such as "green" ADVERB_POS: an adverb, such as "rapidly" PRONOUN_POS: a pronoun, such as "her" CONJUNCTION_POS: a conjunction, such as "and" DETERMINER_POS: a determiner, such as "the" MISC_POS: some other miscellaneous part of speech

typedef unsigned char marklogic::Token::TokenType

The kind of token being returned.

LexerUDF implementations will return tokens of the following types.

SPACE: a whitespace token, will not be indexed PUNCT: a punctuation token, will not be indexed WORD: a separate word in the current language, will be indexed SPECIAL: some other kind of token, will be indexed


The documentation for this class was generated from the following file: