Data Hub Framework Flows

Author: Dave Cassel  |  Category: Software Development

The Data Hub Framework is a feature recently added to MarkLogic that makes it easier to gather data from a variety of sources and build a common representation across the original formats. I learned some useful things about working with the framework that I thought were worth writing down (partly so that I’ll remember them).

My colleague at MarkLogic, Paxton Hare, started the MarkLogic Data Hub Framework project early in 2016. In the Spring of 2017, I joined him on the project. The requirements for this project were drawn from the developers who were building operational data hubs for customers.

Types of Flows

Paxton once told me that he thought about naming the types of flows differently: instead of “input” and “harmonize” flows, he thought “real-time” and “batch” would better describe what they do. I like the term “streaming” instead of “real-time”, to avoid confusion with real-time computing.

Input flows are run as transforms. Some other process sends data to MarkLogic, using MLCP, the REST API, or one of the libraries built on top of the REST API, and an input flow transforms the data along the way. Input flows have no writer, because the flow itself is not responsible for persisting the data. These can be thought of as “streaming” flows, in that the flow is applied to a document between an external process sending a document and MarkLogic persisting it. These are called input flows because they mark the first point of entry into the database.

Harmonize flows are a process in themselves. The first step of a harmonize flow is identifying which identifiers it will work on. (These identifiers may be URIs of documents already in MarkLogic, but they could also be values to search for in a MarkLogic database or they could identify resources to pull in from an external data source.) The flow then transforms and writes each document in turn. This represents a batch approach to modifying documents. They were originally called harmonize flows because this was often the stage where documents were copied from the staging database to final, harmonizing some properties along the way.

When first built, the common pattern of use was that input flows were used to bring data into the staging content database, then harmonize  flows were used to turn raw data in the staging database into commonly-structured envelope documents in the final database. This pattern became so ingrained in me originally that I didn’t notice that you didn’t need both steps. It’s perfectly reasonable to use a input flow to send data directly to the final database.

Putting Flows to Work

For example, I designed a data hub project to collect data from various MarkLogic web sites (www, developer, docs, training, help). Once gathered into a single database, a search service on that database would make discovery of available material across those sites much easier. (We currently have this across www, developer, docs, and some training material, but it would be beneficial to update the implementation and expand the reach.) For this project, the first key is identifying what the common attributes are across the source sites. I came up with the following:

  • url (absolute URL, including protocol)
  • category (technical blog post, tutorial, recipe, guide, etc.; useful as facet)
  • last-updated (a date-time)
  • tags (zero or more tags, with values at the discretion of the content providers)
  • title (a string suitable for display with search results)

Each data source can construct an input flow to build an envelope, with the original content stored in an attachments XML element or JSON property, and the above properties expressed under an instance element or property. (Those element/property names are chosen to be consistent with Entity Services.) This data can be written directly to the final content database. No need to have an input flow to insert data, followed by a separate harmonize flow to construct the envelopes.

Why Harmonize?

So if an input flow can write the harmonized data to the final database, why do we need harmonize flows? These are helpful when the process of building envelopes is less straightforward. As an example, when the Documentation team publishes new content, the guides are part of a large zip file. An input flow can bring the content into MarkLogic, but since it doesn’t have a writer, it won’t be able to break it up into appropriate sized chunks. A harmonize flow can break it up in separate documents and populate the envelope properties (URL, category, and so on).

If your data needs to go into MarkLogic with only a self-contained set of changes, and will remain static once there, MarkLogic recommends using an input flow to send it directly to the final content database. Consider sending data to the staging database and then using a harmonize flow to bring it to the final database if you have any of the following situations:

  • the original content has a significantly different form from what you want to make available in the final database (for instance, documents need to be split up)
  • similar to the above, if your final content database documents will be constructed from multiple input documents (as is often the case when relational data is ingested by table), send those to the staging database, then use a harmonize process to assemble the final entity documents

Iterating

A harmonize flow doesn’t have to move content from one database to another; it can also be used to update content in place.

For the web sites hub described above, suppose that we decided to harmonize an additional property, such as author. We can write a harmonize flow that both reads from and writes to the final content database. Because we store the original content in the attachments element of the document envelope, the flow can extract the content from the original source, add the new property, then overwrite the existing document. We’d need to write this flow for each of the input sources, assuming that the property would be found in different places in the various sources, but this would require very little coding. We’d also need to update the various input flows so that new documents would come in with the author property.

Once we’ve updated the input flows, how do we know which documents need the harmonize update? Here’s a part of the Entity Services document model, inside the instance:

"info": {
  "title": "WebContent", 
  "version": "0.0.1"
}

When we add the author to the model, we increment the version number of the model. The harmonize flow’s collector plugin can then query against the old model version number.

Wrapping Up

The Data Hub Framework goes a long way to simplify the process of building an operational data hub. With a little better understanding of when to use each type of flow, your architecture will work even better.

See Also

Tags: ,

Leave a Reply