Building Semantic Queries
Semantic data in the form of triples that describe the edges of graphs is a powerful data model supported by MarkLogic that you will want to explore. See the Semantic Graph Developer's Guide for more detailed information than we provide here.
Briefly, triples allow you to encode interconnected “facts” in a subject-predicate-object form to express a domain of knowledge from which you can infer other “facts.” For example, from these two triples,
John
(subject) Lives In
(predicate) London
(object) and
London
(subject) Is In
(predicate) England
(object),
we can infer the “fact” that John
Lives In
England
without having that “fact” explicitly stored anywhere in our database.
We can also use triples to standardize our data, drawing on publicly available vocabularies such as naming conventions or official abbreviations.
Triples are normally queried with a language called SPARQL.
Optic provides two Data Accessor Functions for triples queries:
fromTriples()
directly accesses the triples so that you do not need SPARQL to make simple triple pattern matches.fromSPARQL()
lets you use SPARQL to write the more complex and expressive graph queries needed for searching nested taxonomy structures.
We want to find all our employees in the Northeast. Unfortunately, we only have state data in our employee documents. Fortunately, we do have documents containing semantic triples:
ex:CT rdfs:isDefinedBy "CT" ; a ex:State ; skos:broader ex:Northeast ; skos:prefLabel "Connecticut" .
Each of these 4 triples has its own IRI (Internationalized Resource Identifier). They use predefined vocabularies such as RDFS and SKOS shown here as well as others like RDF.
One of these triple facts is that a given state has an official, two-letter abbreviation—which our employee documents use to identify employee states. Another fact is that a given state is in a particular region—such as our needed region, Northeast. This means that we have the data we need to relate our employees’ state data from one set of documents with their regions from another set of documents.
So, with this triples data, we can find all our employees in the Northeast in two steps:
The first step is to produce a row sequence of official codes for states in the Northeast.
An Optic query like this one returns up to 100 rows for triples matching the given patterns:
const ex = op.prefixer('https://example.com/semantics/geo#'); const rdfs = op.prefixer('http://www.w3.org/2000/01/rdf-schema#'); const skos = op.prefixer('http://www.w3.org/2004/02/skos/core#'); const state = op.col('state') op.fromTriples([ op.pattern(state, skos('broader'), ex('Northeast')), op.pattern(state, rdfs('isDefinedBy'), op.col('code')) ]) .offsetLimit(0, 100) .result();
We used this query to find all states whose broader definition is Northeast, then, for each found state, to find its official state code:
We defined three prefixers:
ex
is the base IRI for our triples.rdfs
is the base IRI for the RDFS vocabulary.skos
is the base IRI for the SKOS vocabulary.
We defined two columns with
col()
. They will both appear in our result:col()
identifies the column in its argument.Before the query, we defined
state
.When it was needed within a query function parameter, we defined
code
.
The Data Accessor Function
fromTriples()
returns a row for each triple matching the given pattern specified in thepattern()
functions:The first
pattern()
function finds triples with any subject ifbroader
is the predicate andNortheast
is the object.The second
pattern()
function finds triples with any object if its subject matches one of the states found by the firstpattern()
and its predicate isisDefinedBy
.
The Operator Function
offsetLimit()
restricts results returned. The first parameter specifies the number of results to skip; the second, the number of results to return. So, (0, 100) returns the first 100 results.The Executor Function
result()
executes the query and returns the results as a row sequence.
Here are rows 1-4 of the 11-row x 2-column result:
{ "state": "https://example.com/semantics/geo#CT", "code": "CT" } { "state": "https://example.com/semantics/geo#DE", "code": "DE" } { "state": "https://example.com/semantics/geo#MA", "code": "MA" } { "state": "https://example.com/semantics/geo#MD", "code": "MD" }
There is one row for each of the 11 Northeastern US states:
Its
state
column contains the IRI for the triples graph node.Its
code
column contains the official state code.
You could suppress the
state
column with theselect(code)
operator function.The rows are in an unspecified order, which could change between executions. You can specify row order with the
orderBy()
operator function.
We could have used this fromSPARQL()
query to get the same results:
op.fromSPARQL(` PREFIX ex: <https://example.com/semantics/geo#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?code ?region FROM <https://example.com/semantics/geo> WHERE { ?state skos:broader* ?region . ?state rdfs:isDefinedBy ?code . FILTER (?region = ex:Northeast) } `) .offsetLimit(0, 100) .result();
We would have used it instead of fromTriples()
if the triples we were interested in were nested in a child structure, because SPARQL has the operator *
. Used here on skos:broader
, it would enable the query to search all descendants, not just children.
Either way, we have completed the first step toward finding all our employees in the Northeast. Our second step is to join this triples data with our existing employee data in a multi-model query. The next section describes two ways to accomplish this step.