Building a MarkLogic Data Model

Author: Dave Cassel  |  Category: Software Development

This post is an excerpt from a book I’m working on: MarkLogic for Node.js Developers. This section is part of a chapter on Data Modeling, falling after a comparison to relational database modeling and a discussion of denormalization. The goal is to address the question of what should be a document in MarkLogic. The next section illustrates these points using Samplestack as a case study. Feedback welcome. 

Building a Data Model

There are several factors to consider when deciding what to include in a document.

Document databases can hold many types of documents.

In a document database, documents that represent different types of entities can sit side by side. A book database might have separate documents for books, authors, and publishers, each containing the bulk of information related to that type.

A document is the unit of search.

When doing a search against a MarkLogic database, the typical goal is to identify which documents match a particular query. Understanding what an application’s users will want to search for informs the types of documents you should have.

Include what will be searched for.

When considering a book database, a user might want to search for books by a particular publisher, but will probably not be looking for books published by a company based in a particular region, or founded in a certain year. Such information can be left out of the book document – it will not contribute to search, so there is no benefit to repeating it. Repeating the publisher’s name in book documents makes more sense. Data that is more helpful when searching for publishers will be included in publisher documents.

Don’t repeat what will be updated often.

Pieces of data that will change often should be normalized. For example, a publishing company’s name will not change often, and therefore could be denormalized into other documents if searching on it is important.

Dynamically calculate values that will change quickly.

How many books has an author sold? The answer is the sum of the sales of each book the author has published. The essential data is the per-book sales. The total will change frequently; storing the total will lead to frequent updates and a need to work at the application level to ensure the number stays correct. Conversely, the total is easy to calculate at run-time and can be done very quickly using indexes.

Size documents appropriately.

In MarkLogic, the ideal document size is in the range of 10 kilobyte to 1 megabyte. Larger documents take time to read from disk when they need to be retrieved. Very small documents are less efficient, since there is some overhead introduced for each document.

Choose JSON or XML, or a mix.

In some ways, the choice between JSON and XML is a matter of preference. For a Node.js developer, JSON is a very natural choice, as JSON and JavaScript are so closely related. This book will focus on JSON.

If there is a starting data set that uses XML, the developer may choose to keep it that way in the database, but transform to JSON in response to requests for data.

There are some differences in what can be represented in JSON versus XML. XML is good for text that will be marked up. For instance, consider a document that will be passed to an entity extraction engine to identify person names, locations, organizations, dates, and other information. In some cases, we just want to know that these things exist within a document, in which case we can store it in JSON. However, if we want to mark up the document inline, so that we can later look for entities near each other, XML handles this well. XML also allows for attributes, which describe elements.

<content><person>David Cassel</person> started working for <company>MarkLogic</company> in <date start=”2009-01-01” end=”2009-12-31”>2009</date>. Before that, he worked for <company>Lockheed Martin</company>.</content>
Figure 6: Example XML data showing markup

Overall, XML is an expressive format for representing content (text in a hierarchical structure), while JSON is good for data – key/value pairs, arrays, and other data that consists of scalar data at various levels of the document hierarchy.

MarkLogic is schema-agnostic.

Relational databases require a schema to describe data. XML documents stored in MarkLogic may be required to adhere to an XML schema, but this is optional. In most cases, no formal schema is used and documents with multiple, informal schemas exist within a database. This flexibility is what is meant by schema-agnostic. MarkLogic contrasts that with “schemaless” databases, which do not provide the option to require a schema.

There is no widely accepted standard for JSON schemas at this time and MarkLogic does not support requiring a schema for JSON documents.

MarkLogic supports two-stage queries.

Although MarkLogic documents are typically denormalized, sometimes a query requires some data from one type of document to query a different type of document. Data modeling in MarkLogic seeks to minimize this, but when necessary, an application can do a two-stage query. This is effectively a join and avoided where practical for the same reasons it is problematic for relational databases – two stage queries are necessarily slower than a single-stage query.

Tags: ,

2 Responses to “Building a MarkLogic Data Model”

  1. Matthew Royal Says:

    Great tips! It’s a good prompt for me to look for NoSQL iterative schema design tips.

    Perhaps NoSQL schema design (as opposed to up-front, Waterfall-style design required for RDBMSes) can borrow from the existing body of agile knowledge: there’s a concept of “Emergent Design,” where your team “starts delivering functionality and lets the design emerge. Development will take a piece of functionality A and implement it using best practices and proper test coverage and then move on to delivering functionality B. Once B is built, or while it is being built, the organization will look at what A and B have in common and refactor out the commonality.”


    What do you think about the refactoring aspect? Too heavy for a document unit? Common sense for document contents?

  2. Sumon Says:

    Hi David,
    This is in relation to your post on
    There you discuss the data model for the Samplestack app. Given the data model, how would you suggest one goes about listing top answers by a contributor ordered by the number of up-votes to the answers?

Leave a Reply