Skip to main content

Getting Started with Optic

Building Semantic Queries

Semantic data in the form of triples that describe the edges of graphs is a powerful data model supported by MarkLogic that you will want to explore. See the Semantic Graph Developer's Guide for more detailed information than we provide here.

Briefly, triples allow you to encode interconnected “facts” in a subject-predicate-object form to express a domain of knowledge from which you can infer other “facts.” For example, from these two triples,

          John (subject) Lives In (predicate) London (object) and

          London (subject) Is In (predicate) England (object),

we can infer the “fact” that John Lives In England without having that “fact” explicitly stored anywhere in our database.

We can also use triples to standardize our data, drawing on publicly available vocabularies such as naming conventions or official abbreviations.

Triples are normally queried with a language called SPARQL.

Optic provides two Data Accessor Functions for triples queries:

  • fromTriples() directly accesses the triples so that you do not need SPARQL to make simple triple pattern matches.

  • fromSPARQL() lets you use SPARQL to write the more complex and expressive graph queries needed for searching nested taxonomy structures.

We want to find all our employees in the Northeast. Unfortunately, we only have state data in our employee documents. Fortunately, we do have documents containing semantic triples:

ex:CT           rdfs:isDefinedBy "CT" ;
                a                ex:State ;
                skos:broader     ex:Northeast ;
                skos:prefLabel   "Connecticut" .

Each of these 4 triples has its own IRI (Internationalized Resource Identifier). They use predefined vocabularies such as RDFS and SKOS shown here as well as others like RDF.

One of these triple facts is that a given state has an official, two-letter abbreviation—which our employee documents use to identify employee states. Another fact is that a given state is in a particular region—such as our needed region, Northeast. This means that we have the data we need to relate our employees’ state data from one set of documents with their regions from another set of documents.

So, with this triples data, we can find all our employees in the Northeast in two steps:

The first step is to produce a row sequence of official codes for states in the Northeast.

An Optic query like this one returns up to 100 rows for triples matching the given patterns:

const ex    = op.prefixer('https://example.com/semantics/geo#');
const rdfs  = op.prefixer('http://www.w3.org/2000/01/rdf-schema#');
const skos  = op.prefixer('http://www.w3.org/2004/02/skos/core#');

const state = op.col('state')

op.fromTriples([
   op.pattern(state, skos('broader'), ex('Northeast')),
   op.pattern(state, rdfs('isDefinedBy'), op.col('code'))
])
.offsetLimit(0, 100)
.result();

We used this query to find all states whose broader definition is Northeast, then, for each found state, to find its official state code:

  • We defined three prefixers:

    • ex is the base IRI for our triples.

    • rdfs is the base IRI for the RDFS vocabulary.

    • skos is the base IRI for the SKOS vocabulary.

  • We defined two columns with col(). They will both appear in our result:

    • col() identifies the column in its argument.

    • Before the query, we defined state.

    • When it was needed within a query function parameter, we defined code.

  • The Data Accessor Function fromTriples() returns a row for each triple matching the given pattern specified in the pattern() functions:

    • The first pattern() function finds triples with any subject if broader is the predicate and Northeast is the object.

    • The second pattern() function finds triples with any object if its subject matches one of the states found by the first pattern() and its predicate is isDefinedBy.

  • The Operator Function offsetLimit() restricts results returned. The first parameter specifies the number of results to skip; the second, the number of results to return. So, (0, 100) returns the first 100 results.

  • The Executor Function result() executes the query and returns the results as a row sequence.

Here are rows 1-4 of the 11-row x 2-column result:

{
  "state": "https://example.com/semantics/geo#CT", 
  "code": "CT"
}
{
  "state": "https://example.com/semantics/geo#DE", 
  "code": "DE"
}
{
  "state": "https://example.com/semantics/geo#MA", 
  "code": "MA"
}
{
  "state": "https://example.com/semantics/geo#MD", 
  "code": "MD"
}
  • There is one row for each of the 11 Northeastern US states:

    • Its state column contains the IRI for the triples graph node.

    • Its code column contains the official state code.

  • You could suppress the state column with the select(code) operator function.

  • The rows are in an unspecified order, which could change between executions. You can specify row order with the orderBy() operator function.

We could have used this fromSPARQL() query to get the same results:

op.fromSPARQL(`
    PREFIX ex: <https://example.com/semantics/geo#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

    SELECT ?code ?region FROM <https://example.com/semantics/geo> WHERE {
      ?state skos:broader* ?region .
      ?state rdfs:isDefinedBy ?code .
      FILTER (?region = ex:Northeast)
    }
`)
  .offsetLimit(0, 100)
  .result();

We would have used it instead of fromTriples() if the triples we were interested in were nested in a child structure, because SPARQL has the operator *. Used here on skos:broader, it would enable the query to search all descendants, not just children.

Either way, we have completed the first step toward finding all our employees in the Northeast. Our second step is to join this triples data with our existing employee data in a multi-model query. The next section describes two ways to accomplish this step.