Evolution of modeling relationships in MarkLogic

Author: Dave Cassel  |  Category: Software Development

MarkLogic, as a multi-model database, can store data both as documents and as triples. We model entities as documents. Over time, the way we’ve modeled relationships has changed.

In the Beginning

Prior to MarkLogic 7, relationships were modeled by including either a URI or some other identifier as an element or attribute. In most cases, we’ll also “denormalize” some information from the linked entity into the document that refers to it. In a document describing a Person entity, we might have an element like this:

<author uri="/book/1234.xml">
  <title>All the Birds in the Sky</title>
  <genre>Science Fiction</genre>
</author>

Here we have a relationship between a Person and a Book, where the person is the author of the book. The uri attribute provides the link. We include the title and genre elements because that enables us to search for the Person (the author) based on those pieces of information. If the application needs more information about the book, it can use the uri attribute the load the document.

The only downside to this is that the application needs to know not only that this relationship exists, but exactly where in the document it should look to find it.

For completeness, we can represent that same information using JSON:

{
  "author": {
    "uri": "/book/1234.xml",
    "title": "All the Birds in the Sky",
    "genre": "Science Fiction"
  }
}

These two representations are generally the same (one exception: values in attributes aren’t available for word searches).

Relationships with Triples

The semantic capabilities that MarkLogic has supported since version 7 give us a different way to represent relationships. Instead of the URI of the book document being in an attribute or property, we can represent the connection as a triple:

  <sem:triple>
    <sem:subject>/person/7bfbc09d-ef7f-4976-bf16-763b70bf3995.xml</sem:subject>
    <sem:predicate>http://example.org/wrote</sem:predicate>
    <sem:object>/book/1234.xml</sem:object>
  </sem:triple>

In this triple, the subject is the URI of the Person document, the object is the URI of the book document, and the predicate identifies the relationship. The triple can either be in the Person document or the Book document (an unmanaged triple) or stored with other triples (a managed triple). There are a couple benefits to this.

  1. Identifying all relationships among entities.
  2. Using an ontology to ask more interesting questions.

For point #1, remember that if a relationship is represented in an attribute, element, or property, the application needs to know where to look in the document to find it. With triples, however, a SPARQL query can identify all the author-book relationships very easily:

select ?author ?book
where {
  ?author <http://example.org/wrote> ?book
}

True, that requires knowing the relationship used to connect books and authors. But we can also ask, how is a Person related to other entities?

select ?relationship ?entity
where {
  { </person/7bfbc09d-ef7f-4976-bf16-763b70bf3995.xml> ?relationship ?entity }
union 
  { ?entity ?relationship </person/7bfbc09d-ef7f-4976-bf16-763b70bf3995.xml> }
}

All we need to know in this case is the entity (Person) URI that we want to inquire about. (The union keyword allows us to look for our entity URI in either the subject or object position of a triple.)

Not only does this query tell us what entities the Person entity is connected to, we’re given the predicates that connect them. One of the really cool things about using RDF triples to store relationships is that we can describe the predicates with triples, right in the database itself. For instance, the triple we showed above uses the <http://example.org/wrote> predicate. We can add a triple like this to the database:

  <sem:triple>
    <sem:subject>http://example.org/wrote</sem:subject>
    <sem:predicate>rdfs:comment</sem:predicate>
    <sem:object>Connects an author to a book written by the author</sem:object>
  </sem:triple>

With this in the database, we can show an end-user how the entities are connected, just by expanding our query a bit.

In addition, if we have a good ontology in our database, we can ask more interesting questions. Given the data shown above, our application can look for authors of Science Fiction books. We might be want to ask a broader question and find authors of all types of fiction. If our ontology recognizes that Science Fiction is a type of Fiction, we can use inference to include Science Fiction, Mystery, Historical Fiction, and other sub-types when we look for Fiction authors.

Using SPARQL, we can also use property paths to follow links. Suppose our data set includes <reports-to> links, recording the manager for each employee in an HR database. We can find one person’s boss with a simple query:

select ?boss
where {
  <http://example.com/person/ann> <reports-to> ?boss
}

Suppose we want to find the chain of managers from Ann all the way to the CEO. With the original style of recording URIs in an element, attribute, or JSON property, we’d need to retrieve Ann’s manager, do another query to find that person’s manager, and so on. We’d need the same process with a relational structure. But with SPARQL, we can add a single character to the query above: “+”.

select ?boss 
where {
  <http://example.com/person/ann> <reports-to>+ ?boss
}

The “+” says to follow that predicate 1 more times and returns a list of managers from Ann all the way to the top. SPARQL provides several “property paths” that enable more interesting queries for exploring a data set.

Template Driven Extraction

MarkLogic 9 adds Template Driven Extraction (TDE). Using TDE, we can pull information from documents directly into indexes, accessible from either SQL or SPARQL queries. We can do this without changing the original document structure itself. This approach lets us connect our XML and JSON documents to a full ontology without running transforms.

Wrapping Up

The triples model allows for discoverable connections among entities. The biggest challenge is to make use of this representational power in a useful way, by selecting predicates that link into a relevant ontology. (I like to use http://dbpedia.org/fct/ to help find relevant IRIs.) This only applies when your application has an ontology to connect to, but with that or without it, storing connections among entities, along with descriptions of those connections, allows data and its meaning to exist side-by-side. This advantage, plus the discoverability of relationships, is a clear improvement that will give your applications more powerful search.

 

Tags: , , ,

Leave a Reply