Querying across fragmentsAuthor: Dave Cassel | Category: Software Development
In MarkLogic Server, we tend to think about the documents that we put into the database. Most of the documentation, conversely, talks about fragments. I’ve had to learn some about fragments recently, so I figured I’d share.
What are fragments?
Fragments are one layer of granularity of data in a MarkLogic Server database. In particular, a fragment is the smallest unit that the database will retrieve — though your query might then ask for a subset of that fragment. In the normal course of doing things, a fragment is a document. In this case, we don’t need to worry about the differences. But MLS lets us set a fragment root or fragment parent (using the Admin page; Configure -> Database -> some database -> Fragment Roots/Fragment Parents). When you set one of them, you are specifying where to break a document so that the subparts will be stored separately from the root. The fragments and the root will still be in the same document, and doing fn:doc($uri) on a fragmented document will still give you back the whole thing, but challenge comes in how queries are affected.
How does fragmentation affect queries?
Let’s consider a simple document:
<doc xmlns="http://davidcassel.net/blog"> <foo>1</foo> <frag> <bar>2</bar> </frag> </doc>
Let’s assume no fragmentation, which is the common case. Now, suppose I run a query that checks for particular values of “foo” and “bar”.
declare namespace blog = "http://davidcassel.net/blog"; cts:search(/blog:doc, cts:and-query(( cts:element-value-query(xs:QName("blog:foo"), "1"), cts:element-value-query(xs:QName("blog:bar"), "2") )) )
Simple. Sure enough, I get back my document. Now let’s try fragmenting this document and see what happens. I inserted the document above into my Documents database and then set blog:frag as a fragment root. Now my sample document has two fragments: the frag element and the root (/blog:doc) element.
What happens when we run the same query now that fragmentation is in effect? We get the empty sequence. Why? Because the two elements we are querying are in different fragments.
Querying against fragmented documents
There is still a way to query a document like this — to look for pieces in different fragments, but still identify the whole document. We’ll need to revise the query.
declare namespace blog = "http://davidcassel.net/blog"; cts:search(/blog:doc, cts:and-query(( cts:document-fragment-query(cts:element-value-query(xs:QName("blog:foo"), "1")), cts:document-fragment-query(cts:element-value-query(xs:QName("blog:bar"), "2")) )) )
The cts:document-fragment-query() identifies documents that have a fragment that matches the query — thus letting us cross the fragment boundary. When we run this query, we get back our test document.
I heard a talk early in my time at MarkLogic about fragmentation. The super-short summary of the talk was: “don’t.” The slightly longer summary was: “if you must, here’s what you need to know.” The MarkLogic recommendation in most cases is that if you’re thinking about fragmentation, it’s better to move content into separate documents instead. The reason is that when fragments and documents are no longer the same thing, you can start getting surprising query results, as we saw above. I think it has to do with the simple fact that it’s easy to forget that fragmentation is being done. If you retrieve a document using fn:doc($uri), you won’t see any signs of it. You can easy find yourself staring at the screen saying, “why doesn’t this query find this document?” Using fragmentation requires awareness and an understanding of how the feature works.
Sometimes there are good reasons to split content. Perhaps in your data set a document translates to something rather large, made up of a hundred sizable pieces or so. Let’s suppose that in most of your queries, what you are looking to retrieve is one of those pieces. You run your query, MarkLogic Server finds the piece you need, but it needs to read the entire fragment — the whole document — from disk in order to give you the piece you want. Remember we said that a fragment is the smallest unit the database will retrieve? It needs to bring back a lot of data to give you what you want. By breaking the pieces into their own smaller documents (with a reference back to the parent) or by fragmenting on the pieces, the database would be able to retrieve just the part you are looking for.
This matters for updates, too. When you change some content, the fragment that was changed gets written as a replacement fragment. If you have a huge, unfragmented document, that can be a big write for a little change. Splitting subparts into their own documents allows for smaller writes.
Another case where splitting makes a differences is cts:not-query(). These queries look for matches anywhere within the fragment. There are times when that makes it tough to get what you want. Consider a document that is a book, with subparts that are chapters. Let’s say we want to find chapters that do mention cats but don’t mention dogs (okay dog lovers, stay focused). We can write a cts:and-not-query() (the same as a cts:and-query() wrapping a positive query and a cts:not-query()) to look for cats but exclude dogs. However, if we have one big document for our books, then a mention of a dog anywhere in the book will disqualify that entire fragment — the whole document — from being an answer. By splitting out chapters into their own documents, or by fragmenting on them, we can get the not-queries to be more focused, only eliminating those chapters that contain dogs.
Moral of the story
Wheel of Morality, turn, turn, turn. Today’s moral is that splitting large documents is often necessary, and that fragmentation is one way to accomplish that, but that it can be error prone. If you’re going to use fragmentation, make sure you have a good understanding of how it works and how it will impact your queries.