Skip to main content

What's New in MarkLogic 12

Native Vector Support

With the recent developments in Generative AI (GenAI) and Retrieval Augmented Generation (RAG), vector search has become a valuable tool in a search developer’s toolbox to further improve information retrieval through vector search or vector reranking. Combining keyword-based scoring methods like BM25 with vector similarity operations to perform a “hybrid search” boosts textually and semantically similar documents to the top of your search result.

In a typical RAG architecture, the text classifier models map chunks of text into representative vectors of floating-point numbers. Separate but semantically similar chunks of text would map to vectors that are close to one another in the high-dimensional vector space.

These vectors can then be used to compute, for example, cosine similarity, dot product, or basic vector arithmetic.

MarkLogic 12.0 EA1 introduces operators on vectors. This is the first step towards the full-scale vector search support planned for MarkLogic 12 GA. While this support requires a full implementation of an Approximate Nearest Neighbor (ANN) index, the operations available in MarkLogic 12.0 EA1 give an API preview and allow exploration of hybrid search and vector-based reranking. The vector operations available in MarkLogic 12.0 EA1 include creation, basic arithmetic, score helpers, and cosine and euclidean distance functions that allow linear scans for nearest neighbors.

Integration with text classifier models like those provided by OpenAI is beyond the scope of this article. Please refer to the documentation of your model of choice regarding mapping of text to vector equivalents. The rest of this article assumes that this integration has already happened and that these vectors are now available.

Vectors output by text classifier models are often represented as a JSON array:

JSON (URI: /sample.json)

{
  "envelope": {
    "headers": [
      {
        "textEmbedding" : {
          "lang": "zxx",
          "model": "text-embedding-ada-002",
          "source": "OpenAI",
          "dimension": 1536,
          "vector": [
            0.435647279024124, 0.167360082268715, 0.577132880687714, 0.0405717119574547, -0.345730692148209,
            ...
            0.413799345493317, 0.339704662561417, -0.259793192148209, 0.118780590593815, 0.649678707122803
          ]
        }
      }
    ],
    "instance": {
      "url": "https://simple.wikipedia.org/wiki?curid=8126",
      "text": "The Trojan War was one of the most important ... in the 12th century BC."
    }
  }
}

They can also be represented as a serialized array of numbers in XML:

XML (URI: /sample.xml)

<envelope>
  <headers>
    <text-embedding>
      <model>text-embedding-ada-002</model>
      <source>OpenAI</source>
      <dimension>1536</dimension>
      <vector xml:lang="zxx">[0.435647279024124,0.167360082268715,0.577132880687714,0.0405717119574547,-0.345730692148209,
        ...,0.413799345493317,0.339704662561417,-0.259793192148209,0.118780590593815,0.649678707122803]</vector>
    </text-embedding>
  </headers>
  <instance>
    <url>https://simple.wikipedia.org/wiki?curid=8126</url>
    <text>The Trojan War was one of the most important ... in the 12th century BC.</text>
  </instance>
</envelope>

Ingestion

SQL-aware Retrieval-Augmented Generation (RAG) systems and RAG-enabled Business Intelligence (BI) tools are commonly configured to interact with SQL interfaces. MarkLogic Server supports this through Template Driven Extraction (TDE) -based views, which can ingest document data into table-like structures with columns that can be declared as scalar type vector.

Assuming the context of the template is /envelope, this is what your vector column would look like:

JSON column declaration within a TDE view

"columns": [{
    "name": "textEmbedding",
    "scalarType": "vector",
    "val": "vec:vector(headers/textEmbedding/array-node('vector'))",
    "dimension": "1536"
  },
  …
]

XML column declaration within a TDE view

<columns>
  <column>
    <name>textEmbedding</name>
    <scalar-type>vector</scalar-type>
    <val>vec:vector(headers/text-embedding/vector)</val>
    <dimension>1536</dimension>
  </column>
 …
</columns>

These generated columns can then be accessed through Optic queries.

Reference: "Template Dialect and Data Transformation Functions" in the "Template Driven Extraction (TDE)" chapter of the Application Developer's Guide

Query

The following example query focuses on documents that contain the term trojan. Rows from a view called article are joined on fragment ID, constraining the results to documents that match the search term. queryVector contains the vector generated by sending a chunk of one of these documents to a third-party or internal model. queryVector can be compared with the vector column of each row to generate a vector rating--in this case, a cosine similarity. This rating can either be used to sort the results directly or be combined with the cts.score of a document search to compute a hybrid score:

SJS

const op = require('/MarkLogic/optic');

const documentQuery = cts.wordQuery("trojan")
const queryVector = vec.vector([
    -0.05992345495992422,-0.1234123430928, ... ,-4.549399422136e-05,-0.012034502243
  ])

const documents = op.fromSearch(
    documentQuery, 
    ['fragmentId', 'score'],
    'docs_view', 
    {
      'scoreMethod': 'bm25', 
      'bm25LengthWeight': 0.5
    }
  ).joinDoc('doc',op.fragmentIdCol('fragmentId'));
const rows = op.fromView(
    'examples', 
    'article', 
    null, 
    op.fragmentIdCol('$$viewFragment')
  )

const result =
  documents
    .joinInner(
      rows, 
      op.on(
        op.fragmentIdCol('fragmentId'), 
        op.fragmentIdCol('$$viewFragment')
      )
    )
    .orderBy(op.desc(op.col('score')))
    .limit(30)
    .bind(op.as('queryvector', queryVector))
    .bind(op.as('cosineSim',
      op.vec.cosineSimilarity(
        op.col('textEmbedding'),
        op.col('queryvector')
      )
    ))
    .bind(op.as('hybridScore', 
      op.vec.vectorScore(op.col('score'), op.col('cosineSim'), 0.1)
    ))
    .select([
      op.col('doc'),
      op.col('cosineSim'),
      op.col('score'),
      op.col('hybridScore')
    ])
    .orderBy(op.desc(op.col('hybridScore')))
    .limit(20)
    .result();
result;
  • op.fromSearch() retrieves the cts.score.

  • op.joinDoc() joins the document content to pass to the RAG pipeline.

  • op.limit() reduces the number of documents to be returned for vector computation.

  • op.fromView() retrieves the vector column for processing.

  • op.bind(op.as('cosineSim', ... )) binds a new column that is the result of the similarity calculation between each vector value in the examples view and the query vector.

  • op.vec.cosineSimilarity() computes the cosine similarity between the vector in the textEmbedding column and queryVector.

  • op.vec.vectorScore() is a convenience function. It uses a formula that takes cts.score as a base in the first argument then adjusts it according to the vector similarity in the second argument. In this example, the higher the cosine similarity of the vector, the more significant the boost to that base cts.score. This pushes the more semantically similar documents higher in the result set.

    This formula is used instead of Reciprocal Rank Fusion (RRF), a common method to fuse search results from different sources into one final score.

    Add op.vec.vectorScore()'s third argument, similarityWeight, to tweak the lift that cosine similarity has on the hybrid score:

    • Default: 0.1 (no similarityWeight argument)

    • Lowest: 0.0 (cts.score remains unchanged)

    • Highest: 1.0 (boosts cts.score significantly as the vector similarity (second argument) approaches 1.0)

  • op.select() renders these columns in its result:

    • doc: The document content.

    • cosineSim: The cosineSimilarity between each value in the vector column and queryVector.

    • score: The cts.score.

    • hybridScore: The hybrid score.

  • op.orderBy() with op.desc() on op.col() orders the resulting rows from highest to lowest value in the hybridScore column.

Here is the XQuery version:

XQuery

xquery version "1.0-ml"; 

import module namespace op = "http://marklogic.com/optic"
     at "/MarkLogic/optic.xqy";
import module namespace opvec = "http://marklogic.com/optic/expression/vec"
     at "/MarkLogic/optic/optic-vec.xqy";

let $document-query := cts:word-query("trojan")
let $query-vector := vec:vector((
    0.435647279024124, 0.167360082268715, 0.577132880687714, 0.0405717119574547, -0.345730692148209,
        ...
    0.413799345493317, 0.339704662561417, -0.259793192148209, 0.118780590593815, 0.649678707122803
))

let $documents := op:from-search(
    $document-query,
    ("fragmentId","score"),
    "docs_view",
    map:map()
      => map:with("scoreMethod", "bm25")
      => map:with("bm25LengthWeight", 0.5)
  )
  =>op:join-doc("doc",op:fragment-id-col("fragmentId"))
let $view := op:from-view(
    "examples",
    "article",
    (),
    op:fragment-id-col("$$view-fragment")
  )
return $documents
  => op:join-inner(
    $view, 
    op:on(
      op:fragment-id-col("fragmentId"), 
      op:fragment-id-col("$$view-fragment")
    )
  )
  => op:order-by(op:desc(op:col("score")))
  => op:limit(30)
  => op:bind(op:as("queryvector", $query-vector))
  => op:bind(op:as("cosineSim",
    opvec:cosine-similarity(op:col("textEmbedding"), op:col("queryvector"))
  ))
  => op:bind(
    op:as("hybridScore", 
      opvec:vector-score(op:col("score"), op:col("cosineSim"), 0.1)
    )
  )
  => op:select((
    op:col("doc"),
    op:col("cosineSim"),
    op:col("score"),
    op:col("hybridScore")
  ))
  => op:order-by(op:desc(op:col("hybridScore")))
  => op:limit(20)
  => op:result()

Be sure to explore the other new vector operators.