XML Data Modeling Suggestions

Author: Dave Cassel  |  Category: Software Development

Before I can talk about data modeling, I need to quickly address something else. At MarkLogic, we often talk about how we can take data as-is and do great stuff with it. If we can do great stuff with data as-is, why do we need to do data modeling? There are a couple reasons. First, sometimes data doesn’t come to us as XML, but by making it XML we make it more accessible than if we leave it as binary or text. Second, even if it is XML, some adjustments make it easier to use certain MarkLogic features. So in short, we can work with data as-is, but we can do even more by making some adjustments.

There are few hard and fast rules to pass along, but these suggestions that should be helpful in many situations. One thing I can be sure of is that there will be a variety of opinions on this topic. I invite additional data modeling suggestions as well as comments on what I offer here.

Use Meaningful XML

The one rule agreed on by most people who work with XML a lot is to use meaningful element names. Here’s a counter example:

<!-- ugly XML, don’t do this -->
<element name=”author” value=”David Cassel”/>

The element name “element” tells us nothing about the contents of the element. We can see it from the name attribute, but that requires more steps, whether to read it with your eyes or with an XPath statement. A more expressive structure is

<author>David Cassel</author>

Use Namespaces

One aspect of XML that newcomers to XQuery often find confusing is namespaces.

1   xquery version "1.0-ml";
2   declare namespace blog = "http://davidcassel.net/blog";
3   declare namespace irs = "http://irs.gov";
4
5   <doc xmlns="http://davidcassel.net/blog">
6     <meta>
7       <irs:author xmlns:irs="http://irs.gov">
8         <irs:name>Smith</irs:name>
9       </irs:author>
10    </meta>
11  </doc>/blog:meta/irs:author/irs:name

Looking at the XML above, the top level element says “doc”, which is referred to as the local name, and it has an XML namespace declaration after it. The fully qualified name of an element is called a QName and it consists of the namespace and the local name together. The xmlns declaration used with doc is called a default namespace, which will be the namespace for the doc element and any elements contained within it, except for elements that specify some other namespace. The irs:author element includes a similar declaration, but this one includes the prefix “irs”. This namespace only applies where the “irs” prefix is used. Here, the doc and meta elements are in the default namespace, while author and name specify a different namespace.

Lines 2 and 3 in the code declare namespace prefixes, which give us a concise synonym for a namespace. In the last line, the code selects the irs:name from the block of XML. Note that we need the book prefix to select the meta element – that is because of default namespace inherited from doc. Newcomers to XQuery often miss namespaces when specifying elements, leading to unexpected empty results.

I have said namespaces are confusing. Should you use them? Yes. Namespaces give you a way to differentiate names that are otherwise very similar. Sometimes similar names occur together (a “name” element under a person element and under a company element); sometimes you will find them when you bring together different data sets into the same database. Using different namespaces gives you a way to tell same-named elements apart without having to add prefixes to the names. Namespaces often confuse people new to programming against XML, but you get used to them quickly.

Using a namespace based on your company name helps ensure that collisions won’t happen. I generally use namespaces based on this structure:

http://{organization URL}/{project}

For example, http://davidcassel.net/blog defined on line 2 in Listing 10. That will serve as the main namespace URI for the project. For more variations, add more levels after the project.

Elements and Attributes

A common question is when to represent data in attributes and when to represent it in elements. There is no firm rule, but there are some factors you should use in guiding the choice.

First, XML does not allow an element to have two attributes with the same name. Suppose you want to record the author of document. You might try

<document author="David Cassel"/>

and that would work great – as long as there is only one author. If two people write a book together, you will have to use elements, as the document element cannot have two author attributes.

<document>
  <author>David Cassel</author>
  <author>Gary Katz</author>
</document>

Second, MarkLogic measures the distance between two words based on the text content of elements, but not does include attribute text.

<document>
  <author role="lead">David Cassel</author>
  <author role="contributor">Gary Katz</author>
</document>

In this case, the word “David” is 2 words away from the word “Gary”. The word “contributor”, even though it lies between “David” and “Gary”, does not increase the distance. This is relevant when running near-queries in MarkLogic.

Third, while you can explicitly target attributes in a search, attribute text is not visible to simple word searches. That means that this query will not find the document above:

cts:search(fn:doc(), "contributor")

But this query will find it:

cts:search(fn:doc(), "david")

Based on these three ideas, tend to use elements for text data and attributes for values that describe the data.

Denormalize

One of the principles taught with respect to relational database systems is to normalize your data – that is, move repeated data to another table and set up a link to it. The intention is to have data in just one place, so that there is only one location to update if the data changes. In database design, a developer will select some place along the normalization spectrum based on practical considerations.

With a document-oriented database, the value of having all the data relevant to that document in one place goes up, since relationships between documents are de-emphasized. Having all of a document’s data in one place makes it available for search all at once. In practice, some normalization may be done, but less than in relational systems.

As an example, consider a blogging system. A blog post document will naturally include a post title, content, and category. Suppose the system allows users to add tags – typically one- or two-word labels that describe content. In a relational system, there would likely be a table that lists the tags, along with a unique primary key, and the post table would include a foreign key reference to that table. The post representation in XML, however, would likely include the text of the tags themselves. This representation means that a search for a word will find a blog post if the word is in the title, the text or any of the tags, without the need for a join. MarkLogic performs this very efficiently using the Universal Index.

Why is it helpful to have the data stored together? Consider a database cluster. In a document-oriented database, a node in the cluster can likely determine which of its documents match a given query. In the relational world, most sharding techniques would require joins across nodes to find which rows are relevant.

Work with the Indexes

An understanding of MarkLogic’s indexes will help you set up your data such that the indexes will be able to use it. There are a couple ways to do this.

First, structure your values to match XQuery’s type formats. XQuery has a particular format for date and time representations. If you make your data use that format, you can build a range index on it. Then you can run very efficient inequality queries, such as “give me all documents modified after January 1, 2010”. The (small) effort to convert dates and date-times to the XQuery format is almost always worthwhile.

Second, for elements where you care about values rather than words, it’s better to have multiple elements than to put multiple values into one element. As an example, consider the HTML class attribute. Sometimes you want to give an element more than one class. Multiple class attributes are not allowed, so HTML allows you to put multiple classes in one class attribute, separated by spaces:

<div class=”class1 class2”/>

You can do that with your XML data, but doing so affects how you can use an index you set up on the attribute. Instead of simply looking for documents that have a class attribute of “class1”, would need to set up an attribute word lexicon and target that in your searches. For XML data, I prefer splitting that value into separate elements or attributes:

<tags>
  <tag>red</tag>
  <tag>blue</tag>
</tags>

With this structure, “red” and “blue” are each distinct values. If I build a range index on tag, then MarkLogic can search for each of the values very efficiently.

A third consideration is how MarkLogic lets you build indexes. An element or attribute index is based on the QName of the element, or the QNames of the attribute and its parent element. Suppose you have publishers and authors in your database, and each has a “name” element. At present, MarkLogic does not provide a way to distinguish between publisher names and author names in the index, making it necessary to further distinguish between them at the data level, either through different namespaces or more distinct names.

Update: As of version 6, MarkLogic provides path range indexes, so now we can build an index on, for instance, /book:book/book:author/book:name. There are corresponding search functions to take advantage of these.

Comments Welcome

Well, those are the tips I’ve figured out based on working at MarkLogic for the last few years. I’m happy to hear from people who have other suggestions or who disagree with those I’ve offered.

My thanks to Damon for reviewing an earlier version of this text. 

Tags: , ,

Leave a Reply