Building a MarkLogic Data ModelAuthor: Dave Cassel | Category: Software Development
This post is an excerpt from a book I’m working on: MarkLogic for Node.js Developers. This section is part of a chapter on Data Modeling, falling after a comparison to relational database modeling and a discussion of denormalization. The goal is to address the question of what should be a document in MarkLogic. The next section illustrates these points using Samplestack as a case study. Feedback welcome.
There are several factors to consider when deciding what to include in a document.
Document databases can hold many types of documents.
In a document database, documents that represent different types of entities can sit side by side. A book database might have separate documents for books, authors, and publishers, each containing the bulk of information related to that type.
A document is the unit of search.
When doing a search against a MarkLogic database, the typical goal is to identify which documents match a particular query. Understanding what an application’s users will want to search for informs the types of documents you should have.
Include what will be searched for.
When considering a book database, a user might want to search for books by a particular publisher, but will probably not be looking for books published by a company based in a particular region, or founded in a certain year. Such information can be left out of the book document – it will not contribute to search, so there is no benefit to repeating it. Repeating the publisher’s name in book documents makes more sense. Data that is more helpful when searching for publishers will be included in publisher documents.
Don’t repeat what will be updated often.
Pieces of data that will change often should be normalized. For example, a publishing company’s name will not change often, and therefore could be denormalized into other documents if searching on it is important.
Dynamically calculate values that will change quickly.
How many books has an author sold? The answer is the sum of the sales of each book the author has published. The essential data is the per-book sales. The total will change frequently; storing the total will lead to frequent updates and a need to work at the application level to ensure the number stays correct. Conversely, the total is easy to calculate at run-time and can be done very quickly using indexes.
Size documents appropriately.
In MarkLogic, the ideal document size is in the range of 10 kilobyte to 1 megabyte. Larger documents take time to read from disk when they need to be retrieved. Very small documents are less efficient, since there is some overhead introduced for each document.
Choose JSON or XML, or a mix.
If there is a starting data set that uses XML, the developer may choose to keep it that way in the database, but transform to JSON in response to requests for data.
There are some differences in what can be represented in JSON versus XML. XML is good for text that will be marked up. For instance, consider a document that will be passed to an entity extraction engine to identify person names, locations, organizations, dates, and other information. In some cases, we just want to know that these things exist within a document, in which case we can store it in JSON. However, if we want to mark up the document inline, so that we can later look for entities near each other, XML handles this well. XML also allows for attributes, which describe elements.
<content><person>David Cassel</person> started working for <company>MarkLogic</company> in <date start=”2009-01-01” end=”2009-12-31”>2009</date>. Before that, he worked for <company>Lockheed Martin</company>.</content>
Figure 6: Example XML data showing markup
Overall, XML is an expressive format for representing content (text in a hierarchical structure), while JSON is good for data – key/value pairs, arrays, and other data that consists of scalar data at various levels of the document hierarchy.
MarkLogic is schema-agnostic.
Relational databases require a schema to describe data. XML documents stored in MarkLogic may be required to adhere to an XML schema, but this is optional. In most cases, no formal schema is used and documents with multiple, informal schemas exist within a database. This flexibility is what is meant by schema-agnostic. MarkLogic contrasts that with “schemaless” databases, which do not provide the option to require a schema.
There is no widely accepted standard for JSON schemas at this time and MarkLogic does not support requiring a schema for JSON documents.
MarkLogic supports two-stage queries.
Although MarkLogic documents are typically denormalized, sometimes a query requires some data from one type of document to query a different type of document. Data modeling in MarkLogic seeks to minimize this, but when necessary, an application can do a two-stage query. This is effectively a join and avoided where practical for the same reasons it is problematic for relational databases – two stage queries are necessarily slower than a single-stage query.