Welcome to MarkLogic

This is part of a draft of MarkLogic 8 for Node.js Developers. Incomplete sections are [marked with brackets].

MarkLogic provides developers with a powerful set of tools for solving complex problems stemming from the volume, velocity, and variety of Big Data problems. This book will introduce the concepts of MarkLogic and illustrate them using a substantial application.

Why MarkLogic?

MarkLogic was founded in 2001 to solve problems related to what we’ve come to know as Big Data. This term is used to describe the three-way challenge of data with high volume, variety, and velocity.

High volume refers to large quantities of data, often in the terabyte or petabyte scale. MarkLogic addresses this problem partly by scaling out, allowing the use of commodity hardware to expand capacity, but also by effective use of indexes and map/reduce approaches to provide fast responses even as the volume of content grows. The document nature of MarkLogic storage helps achieve this ability to scale.

Data with a lot of variety poses a substantial problem for technologies that require a schema design before data can be ingested. MarkLogic’s schema-agnostic approach allows the presence of data with different schemas side-by-side in the same database, allowing developers to focus on how to make use of the data in an application, rather than spending a lot of time to figure out how to represent it.

The velocity of change is a similar type of problem. Designing a schema for a relational database often requires a significant amount of work. Designers try to anticipate change, but when changes happen a lot of effort is needed. Furthermore, a change in a relational schema will touch every row in the affected tables. With MarkLogic’s document orientation, work resulting from schema changes focuses on the application itself. In many cases, a schema change only affects a subset of the documents, and only these will need to be updated.

These differences and more are addressed in more detail in the Data Modeling chapter.

MarkLogic has been working with customers in Media, Publishing, Public Sector, Financial Services, and other industries, starting with the version 1 release in 2003.

Concepts

MarkLogic is an enterprise-class NoSQL information store and search engine. There’s a lot contained in that sentence — let’s break it down into pieces.

Enterprise Class

MarkLogic supports ACID transactions, government-grade security, high availability, and disaster relief. These are all features you’d expect from a database that large organizations trust with their critical data. Appendix A addresses how MarkLogic supports ACID transactions. The Security chapter discusses the role-based security approach used within MarkLogic.

NoSQL

“NoSQL” databases emerged in response to the need for new ways to manage data, as many projects struggled to meet their needs with traditional relational databases. The term refers to SQL, the query language used to interact with relational databases. Note that some NoSQL databases, including MarkLogic, do support some degree of interaction through SQL, leading some to expand “NoSQL” to “Not-Only SQL”.

Information Store

MarkLogic provides a document store and a triple store, providing tremendous flexibility in the types of data it can handle. We’ll explore what each of these means to you as a developer in the coming chapters.

Search Engine

In addition to storing data, MarkLogic provides a powerful set of features to search that data. More than 30 types of indexes power this capability, leading to very fast search results. The Search chapter describes this topic in detail.

MarkLogic Application Architecture

Applications are typically built with three tiers: the database, application, and presentation tiers. Each of these layers has its role, though the lines between them can be blurry.

Database Tier

MarkLogic provides the database tier, taking responsibility for storing and retrieving data with consistency and durability. Interaction between the application tier and a MarkLogic database goes through the MarkLogic REST API, Java Client API, Node.js Client API, or through custom-built endpoints. Using one of the provided APIs means broad out-of-the-box capabilities, in addition to a mechanism to extend that API using Server-side JavaScript or XQuery. In this book, we’ll focus on the Node.js Client API.

Middle Tier

The middle tier defines the interface that will be used by the presentation tier, controlling access to the database’s API. Business logic is often implemented here, though MarkLogic’s support of complex processing in the database means it is sometimes helpful to move the code close to the data. The middle tier is also a good place for code that interacts with third-party systems, such as using social networks for logging in or resizing uploaded images.

In this book, the middle tier will be implemented with Node.js.

Presentation Tier

The presentation tier is what the end-user actually sees. This may take the form of a web page viewed in a browser, a mobile app, or a desktop application. The presentation tier will send messages to the middle tier based on the user’s actions.

Working With MarkLogic

MarkLogic offers a variety of ways to interact with the database. Each of these goes through an application server, which is also included in MarkLogic. In this book, we will work with HTTP application servers, but there are also XDBC, ODBC, and WebDAV application servers. For more information about these, see the MarkLogic Administrator’s Guide.

Query Console

One of the applications that MarkLogic ships with is call Query Console. This provides a way to run ad hoc queries using Server-side JavaScript, XQuery, SPARQL, or SQL.

JavaScript and XQuery are used to query and update a database and to transform data. SPARQL is used for Semantic queries and is covered in the Semantics Chapter. The SQL view interface is primarily for connecting Business Intelligence tools and is not covered in this book. See MarkLogic’s SQL Data Modeling Guide for more information.

If you have installed and started up MarkLogic, Query Console should be running. Point your browser to http://localhost:8000/qconsole/. You will see something like Figure 1.

Query Console Figure 1: Query Console

Query Console provides buffers, such as “Query 1”, where you can type JavaScript expressions, click the Run button, and see the results in the lower section. Workspaces are listed on the right side of the screen. Each Workspace consists of a set of buffers. Figure 1 shows Query 1 in the default Workspace.

Developers can use Query Console to experiment with code, figuring out the right way to express a query or other task. Query Console can also be used to make small updates to a database.

[More about QC. More information in the QC Guide.]

Node.js Client API

For the technology stack used in this book, JavaScript is the language of choice throughout the tiers. MarkLogic uses the term “Server-side JavaScript” to refer to JavaScript running on the V8 engine embedded within MarkLogic, so I’ll use that term the same way. Node.js also uses JavaScript code on the server, but in a different tier. The language is the same, though there are some important usage differences.

Node.js is optimized for I/O heavy applications. MarkLogic is the perfect companion for Node, as much of the analytical processing and data transformations can be handled in the database itself.

The programming model used in Node is asynchronous. The application makes a request then works on something else while waiting for the request to complete. The MarkLogic Node.js Client API supports callbacks, Promises, and streaming as asynchronous approaches.

MarkLogic uses JavaScript to extend the built-in capabilities, but uses a synchronous interface to do so. While Node.js is single-threaded, MarkLogic application servers use multiple threads to handle multiple client requests. MarkLogic also uses lazy evaluation to increase parallelism in its processing.

Documents are stored in MarkLogic as JSON, XML, text, or binary; the choice among these options is discussed in the chapter on Data Modeling. Applications commonly use more than one. Using JSON documents gives the advantages of not needing to transform them, but there are advantages to XML as well, particularly for HTML or text content.

Listing 1: Example of inserting a JSON document and reading it back

Listing 1 demonstrates the Node Client API, saving a document to the database and reading it back. Line 1 loads the Node module, using a typical Node “require” statement. Lines 3-8 establish a connection to the database, using the built-in App-Services application server on port 8000 and the admin user[1]. Each application server is configured to use a specific content database. By default, the App-Services application server points to the Documents database, so that is where the document will be loaded.

Line 10 specifies the document’s URI, which uniquely identifies the document within the database. Lines 12-18 write the document into the database. The write function takes a document descriptor that specifies the URI and the content of the new document. In this case, the document is itself a simple JSON object, with the title and author of this book.

The Node Client API offers choices for handling responses. Lines 18-25 demonstrate the Promise pattern. Other choices are Callback, Object Mode Streaming, and Chunked Mode Streaming. These options are discussed in the Key Concepts and Conventions section of the Node.js Application Programmer’s Guide published by MarkLogic. This book will focus on the Promise pattern.

[Discuss error handling (maybe not here). For each method, describe and show how to catch errors and what type of errors get caught.]

Samplestack

The goal of this book is to show you how to build MarkLogic applications. You will learn both by reading about concepts and seeing them put into practice in Samplestack, an implementation of the MarkLogic Reference Architecture.

Samplestack is based on the popular question-and-answer website Stack Overflow. Stack Overflow provides data downloads, which were used to seed the Samplestack data set. Samplestack modifies the original application in a few ways, in order to illustrate MarkLogic concepts.

Setup

To follow along, you can set up Samplestack on your own computer.

[Describe how to install and run Samplestack.]

Features

Each of the features in Samplestack was selected to illustrate important concepts in MarkLogic. In the last section, you saw how to install and run Samplestack.

[add Samplestack screenshot]

Figure 2: Samplestack’s initial view

After starting up Samplestack, point your browser to http://localhost:3000 and you’ll see the initial view. Samplestack is a question-and-answer site. Logged-in users can ask questions, answer them, and comment or vote on questions and answers. When the asker of a question sees an answer that satisfies his or her need, the asker can accept that answer, causing it to be displayed above other answers. Guest users can see questions that have accepted answers and can search by terms, tags, date, or user. Getting votes and having answers accepted influences a user’s reputation.

Each feature in Samplestack was selected to illustrate some aspect of MarkLogic.

  • Text and facet search: MarkLogic indexes the text from all documents, allowing fast searches for words and phrases. Samplestack also provides facets on dates and tags, allowing the user to explore content.
  • User records and Question documents: the content of Samplestack’s database consists of two types of documents. The chapter on Loading and Modeling Data discusses the thought process for modeling data this way.
  • Users and Roles: only logged-in users may use features that change the content of the database. Guest users only see questions that have accepted answers. The Security chapter shows how this works.
  • Voting: A vote not only affects the answer to which it is applied; it also changes the reputation of the person who wrote the answer. A vote triggers a multi-document update performed in a single transaction to ensure data integrity.
  • Related tags: MarkLogic is a semantic triple store, in addition to being a document store. This feature lets users browse by related tags to find questions that might be of interest.

The rest of this book will use Samplestack features to illustrate important concepts you will use in building your own applications.

Additional Resources

  • MarkLogic University on demand video: “Introduction to MarkLogic”. This 24-minute video introduces MarkLogic at a high level.
  • MarkLogic University instructor-led training: “MarkLogic Fundamentals”. This one-day course goes deeper to introduce MarkLogic’s use cases and capabilities.
  • Samplestack GitHub repository: On GitHub, you can request new features, report bugs, and explore the source code.
  • MarkLogic University on demand video: “Samplestack Demo“. Get a preview of what this application does.

[1] The admin password is established when your MarkLogic instance is first configured. Using “admin” for the password is okay for your laptop, but you’ll want something more secure for other servers.