BM25 Relevance Ranking
MarkLogic 12.0 EA1 supports Best Match 25 (BM25) for relevance scoring of search results.
The BM25 method of scoring documents is widely used because of its effectiveness in ranking documents based on relevance to a query. With the rise of Generative AI (GenAI) and Retrieval Augmented Generation (RAG) workflows, BM25 has come into play for retrieving relevant documents to supply as context to a Large Language Model (LLM). LLM-generated answers see a significant quality boost when ranked documents become the sources of knowledge.
The MarkLogic Server core text search has always returned results based on relevance. The default MarkLogic Server relevance scoring method, logTF-IDF, implicitly weights document length by counting unique terms. The new BM25 scoring method explicitly does this by adding a tunable parameter to directly increase or decrease the weight of a document's length on the score.
Consider the following documents' term frequencies and lengths retrieved by cts.wordQuery("trojan")
:
URI |
Term Frequency |
Length (Average: 1396 characters) |
---|---|---|
/doc1.json |
3 |
1628 |
/doc2.json |
2 |
976 |
/doc3.json |
2 |
916 |
/doc4.json |
10 |
2592 |
/doc5.json |
2 |
868 |
Compare the ranking differences between traditional logTF-IDF, which uses term frequency alone, and BM25, which heavily penalizes /doc4.json
for being significantly longer than the average document length of 1396:
Ranking |
logTF-IDF |
BM25 |
---|---|---|
1st |
/doc4.json |
/doc5.json |
2nd |
/doc1.json |
/doc2.json |
3rd |
/doc2.json |
/doc3.json |
4th |
/doc3.json |
/doc1.json |
5th |
/doc5.json |
/doc4.json |
The BM25 scoring method accounts for the fact that a short document with a high term frequency is more likely to be about that term than a long document with the same term frequency, which is more likely to simply be using that term in passing.
Use one of these code samples to enable the BM25 scoring method for core text search queries:
Each specifies BM25 as the score method.
Default: logTF-IDF
Each provides the optional parameter BM25 length weight of
0.25
:Default:
0.33
(no BM25 length weight parameter)Lowest: Just above
0.0
(most similar to logTF-IDF results)Highest:
1.0
(weight document length as heavily as possible)
CTS
cts.search( cts.wordQuery("trojan"), [ "score-bm25", "bm25-length-weight=0.25" ] )
Optic
const op = require('/MarkLogic/optic'); op.fromSearch( cts.wordQuery("trojan"), null, null, { "scoreMethod" : "bm25", "bm25LengthWeight": 0.25 } )
Note
op.fromSearchDocs()
also takes BM25 scoring parameters.
Search/REST Search API
import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy"; search:search("trojan", <options xmlns="http://marklogic.com/appservices/search"> <search-option>score-bm25</search-option> <search-option>bm25-length-weight=0.25</search-option> </options> )
JSearch
import jsearch from '/MarkLogic/jsearch.mjs'; jsearch.documents() .where(cts.wordQuery("trojan")) .withOptions( { search: [ 'score-bm25', 'bm25-length-weight=0.25' ] } ) .result()
Using core text search within MarkLogic Server has always been the key to finding the data that you need. BM25 is an extra knob to further tune your search results.