Skip to main content

What's New in MarkLogic 11

Default Stemming and Tokenization Libraries Changed for Most Languages

In MarkLogic 9 and later, the default tokenization and stemming libraries have been changed for all languages (except English) tokenization. Consequently, some tokenization and stemming behavior changed between MarkLogic 8 and MarkLogic 9. In most cases, stemming and tokenization will be more precise in MarkLogic 9 and later.

If you upgrade to MarkLogic 9 or later from an earlier version of MarkLogic, your installation will continue to use the legacy stemming and tokenization libraries as the language baseline. Any fresh installation of MarkLogic will use the new libraries. You can change the baseline configuration using admin:cluster-set-language-baseline.

Note

  • Changing the baseline requires a cluster-wide restart and a reindex to avoid stemming and tokenization anomalies.

  • Use of the legacy libraries is deprecated. These libraries will be removed from MarkLogic in a future release.

Unless you use the legacy language baseline, reindexing is required for content in the following languages:

  • Chinese

  • Danish

  • Dutch, if you want to query with decompounding

  • Finnish

  • German

  • Hungarian

  • Japanese

  • Korean, unless you use decompounding

  • Norwegian (Bokmal and Nynorsk) if you want to query with decompounding

  • Norwegian (generic ‘no’ lang code), though use of generic ‘no’ is not recommended

  • Romanian

  • Russian

  • Swedish, if you want to query with decompounding

  • Tamil

  • Turkish

For other languages (except English), you might be able to avoid incompatibilities depending on the nature of your queries, but reindexing is still strongly recommended.

Tokenization and stemming are significantly different for Japanese. Tokenization is significantly different for Chinese (both simplified and traditional). The impact on other languages is more nuanced, but should lead to better results, overall. You might observe some relevance score changes on stemmed searches due to the higher degree of precision. If you require low-level details about the impact on a specific language and you have an active maintenance contract, you can contact MarkLogic Technical Support.

For more details on incompatibilities related to the changes to stemming and tokenization, see MarkLogic Server v9 Tokenization and Stemming from MarkLogic Technical Support.

For more information about the new tokenization and stemming support, see Default Stemming and Tokenization Libraries Changed for Most Languages.