Skip to main content

What's New in MarkLogic 12

BM25 Relevance Ranking

MarkLogic 12.0 EA1 supports Best Match 25 (BM25) for relevance scoring of search results.

The BM25 method of scoring documents is widely used because of its effectiveness in ranking documents based on relevance to a query. With the rise of Generative AI (GenAI) and Retrieval Augmented Generation (RAG) workflows, BM25 has come into play for retrieving relevant documents to supply as context to a Large Language Model (LLM). LLM-generated answers see a significant quality boost when ranked documents become the sources of knowledge.

The MarkLogic Server core text search has always returned results based on relevance. The default MarkLogic Server relevance scoring method, logTF-IDF, implicitly weights document length by counting unique terms. The new BM25 scoring method explicitly does this by adding a tunable parameter to directly increase or decrease the weight of a document's length on the score.

Consider the following documents' term frequencies and lengths retrieved by cts.wordQuery("trojan"):

URI

Term Frequency

Length (Average: 1396 characters)

/doc1.json

3

1628

/doc2.json

2

976

/doc3.json

2

916

/doc4.json

10

2592

/doc5.json

2

868

Compare the ranking differences between traditional logTF-IDF, which uses term frequency alone, and BM25, which heavily penalizes /doc4.json for being significantly longer than the average document length of 1396:

Ranking

logTF-IDF

BM25

1st

/doc4.json

/doc5.json

2nd

/doc1.json

/doc2.json

3rd

/doc2.json

/doc3.json

4th

/doc3.json

/doc1.json

5th

/doc5.json

/doc4.json

The BM25 scoring method accounts for the fact that a short document with a high term frequency is more likely to be about that term than a long document with the same term frequency, which is more likely to simply be using that term in passing.

Use one of these code samples to enable the BM25 scoring method for core text search queries:

  • Each specifies BM25 as the score method.

    Default: logTF-IDF

  • Each provides the optional parameter BM25 length weight of 0.25:

    • Default: 0.33 (no BM25 length weight parameter)

    • Lowest: Just above 0.0 (most similar to logTF-IDF results)

    • Highest: 1.0 (weight document length as heavily as possible)

CTS

cts.search(
  cts.wordQuery("trojan"), 
  [
    "score-bm25",
    "bm25-length-weight=0.25"
  ]
)

Optic

const op = require('/MarkLogic/optic');
op.fromSearch(
  cts.wordQuery("trojan"), 
  null, 
  null, 
  {
    "scoreMethod" : "bm25",
    "bm25LengthWeight": 0.25
  }
)

Note

op.fromSearchDocs() also takes BM25 scoring parameters.

Search/REST Search API

import module namespace search = "http://marklogic.com/appservices/search"
    at "/MarkLogic/appservices/search/search.xqy";
search:search("trojan",
  <options
      xmlns="http://marklogic.com/appservices/search">
    <search-option>score-bm25</search-option>
    <search-option>bm25-length-weight=0.25</search-option>
  </options>
)

JSearch

import jsearch from '/MarkLogic/jsearch.mjs';
jsearch.documents()
  .where(cts.wordQuery("trojan"))
  .withOptions(
    {
      search: [
        'score-bm25', 
        'bm25-length-weight=0.25'
      ]
    }
  )
  .result()

Using core text search within MarkLogic Server has always been the key to finding the data that you need. BM25 is an extra knob to further tune your search results.