Default Stemming and Tokenization Libraries Changed for Most Languages

In MarkLogic 9 and later, the default tokenization and stemming libraries have been changed for all languages (except English) tokenization. Consequently, some tokenization and stemming behavior changed between MarkLogic 8 and MarkLogic 9. In most cases, stemming and tokenization will be more precise in MarkLogic 9 and later.

If you upgrade to MarkLogic 9 or later from an earlier version of MarkLogic, your installation will continue to use the legacy stemming and tokenization libraries as the language baseline. Any fresh installation of MarkLogic will use the new libraries. You can change the baseline configuration using admin:cluster-set-language-baseline.

Note

Changing the baseline requires a cluster-wide restart and a reindex to avoid stemming and tokenization anomalies.
Use of the legacy libraries is deprecated. These libraries will be removed from MarkLogic in a future release.

Unless you use the legacy language baseline, reindexing is required for content in the following languages:

Chinese
Danish
Dutch, if you want to query with decompounding
Finnish
German
Hungarian
Japanese
Korean, unless you use decompounding
Norwegian (Bokmal and Nynorsk) if you want to query with decompounding
Norwegian (generic ‘no’ lang code), though use of generic ‘no’ is not recommended
Romanian
Russian
Swedish, if you want to query with decompounding
Tamil
Turkish

For other languages (except English), you might be able to avoid incompatibilities depending on the nature of your queries, but reindexing is still strongly recommended.

Tokenization and stemming are significantly different for Japanese. Tokenization is significantly different for Chinese (both simplified and traditional). The impact on other languages is more nuanced, but should lead to better results, overall. You might observe some relevance score changes on stemmed searches due to the higher degree of precision. If you require low-level details about the impact on a specific language and you have an active maintenance contract, you can contact MarkLogic Technical Support.

For more details on incompatibilities related to the changes to stemming and tokenization, see MarkLogic Server v9 Tokenization and Stemming from MarkLogic Technical Support.

For more information about the new tokenization and stemming support, see Default Stemming and Tokenization Libraries Changed for Most Languages.

In this section:

What's New in MarkLogic 11

Default Stemming and Tokenization Libraries Changed for Most Languages

Note

Search results