Software Development – David Cassel

Software Development in the Age of AI

Dave Cassel — Thu, 22 Jan 2026 17:52:53 +0000

What does it mean to be a software developer at a time when AI can write code? I’ve been experimenting with AI tools to answer that question for myself. I’ve found what others have concluded: LLM-powered AI tools can be a great productivity boost in the right hands, but does not fully replace a skilled developer.

Let’s cover a critical point early. If you take one thing away from this post, make it this:

The person using the tool remains responsible for the quality of the work.

I was taught the old adage “a poor craftsman blames his tools.” That saying applies here. But what does this mean in practice? How do I, as a human developer, work with AI to get work done quickly and effectively while keeping quality up?

Tools

AI tools can be used in a variety of ways. For this experiment, I wanted to explore the boundaries, so I had AI writing code for me, not just being a sounding board. For my exploration, I used three tools.

Manus (owned by Meta as of December 2025)
Microsoft Copilot
Anthropic’s Claude

I used this approach with a couple projects:

Building a simple CRUD web site from scratch
Making progress on a dormant project that was human-built

Roles

Specify

Part of software development is translating business requirements into technical specifications. Vague requirements lead to guesswork about what needs to be done. Your AI will make choices that might work for you, but it’s very likely you’ll end up revising the prompt and asking it to do something different. This delays the work and wastes capacity.

This is true with human developers as well, of course. With a vaguely worded ticket, a developer might come back to the product owner (or whoever is providing requirements) and ask clarifying questions, or might make some assumptions. Some of these assumptions may not even be conscious ones. As developers working with AI, we need to ensure we are building the right thing.

A good prompt makes a big difference. Give the tool some context about what you’re trying to accomplish. Indicate software or library choices that have already been made to constrain the approach. Work in steps: ask the AI to review the instructions, ask clarifying questions, then propose an approach. Review the approach and provide any course correction necessary before unleashing it to write code.

We can think of this aspect as the latest type of abstraction. Ultimately, our work leads to high or low voltage on wires. We abstract to binary, to assembly code, to low-level and high-level languages. We use libraries and external services. Each of these allows us to focus more on what we’re trying to build instead of the details of how to build it. Providing detailed descriptions of what we want AI to accomplish provides another layer of abstraction.

Review

We can ask our AI tool for different levels of help. We could ask for a couple approaches that we then implement. We could ask it to produce a skeleton structure with comments of what needs to be filled in. With an Agentic AI, we can go as far as “Create a draft PR that….” I used this last approach for both of my experimental projects. I want a draft PR (which Github will block from merging until it has been converted to a regular PR) because I need to review what it has come up with before the work is really a candidate for inclusion in the project.

Regardless of what level of help we request, our next step is to evaluate what the AI tool has come up with. Does the approach make sense? Is it scalable? Is it maintainable? At least with today’s technology, AI tools make mistakes (take a close look at the picture at the top of this post). It is the job of knowledgeable professionals who use the tools to catch the mistakes before they take root.

These mistakes can take many forms. One simple example: I had Manus create a page that let me browse items and filter by a few criteria. Its approach was to return all data to the browser and then filter client side. That could be okay for very small data sets, but is not scalable. I had it revise the code so that filtering would happen at the database level, reducing the amount of data that gets sent to the browser. The software developer’s role here is to 1) recognize the error and 2) either directly fix the error or provide direction on how to fix the error.

Of course, an error can be much bigger in scope than this simple example. If the AI tool is making choices about software components, it takes knowledge and experience to consider whether those choices will be valid over time. Getting locked into an approach that causes problems a year down the road is no fun. In contrast to the Specify stage, this is where we ensure that we are building the thing right.

With a fully human developer team, this process is part of peer reviews.

QA

Assuming that the approach taken is valid, there’s still the key question of whether it works correctly. Does it meet all the acceptance criteria? Does it do the right thing with expected inputs? Does it respond gracefully to unexpected inputs? I would argue that the QA step (which is necessary regardless of how code is created) becomes even more important than ever. For a human developer using an AI tool, that developer needs to take on at least the initial QA responsibility.

Fix What it Can’t

I’d expect anyone who has tried this approach has found cases where the AI simply failed at the task. It produces the wrong thing, or code that is invalid, or just doesn’t work. I’ve gotten into loops:

me: here’s the problem
AI: okay, try this (solution A)
me: that didn’t work, here’s what happened
AI: ah yes, of course, try this (solution B)
me: that didn’t work, here’s what happened
AI: ah yes, of course, try this (solution C)
me: that didn’t work, here’s what happened
AI: ah yes, of course, try this (solution A again!)
me: … we already tried that?

Part of the human developer’s role is to be able to make progress from this point. Sometimes the approach is to provide a better prompt. I’ve also gotten good results having a different AI system tackle the same problem, starting from where the first one got stuck. I’ve had Claude give me good answers when Manus can’t figure something out. Sometimes we just have to apply the knowledge and skill that makes us highly valued professionals.

Further Considerations

There are a few other things that developers (and their managers) should think about.

Security

Since the rise of LLMs, this has been an important consideration. Where are your inputs going? Will they become part of future model training? For situations where you need to protect sensitive data or intellectual property, knowing the answers to these questions is important. For example, Manus is based in Singapore. Some types of data can’t be shared with Manus due to privacy restrictions. The developer needs to be very aware of such restrictions. In some cases, the solution is for a business to have its own in-house LLMs. That introduces a maintenance burden (and may not have all the capabilities of some cloud-based tools), but gives confidence that the business can control its data.

Subscription Costs

Do you know how pricing works for your AI tools? Understanding pricing structures and capabilities is important for getting the best results. If I need a throw-away script to accomplish a simple task, Copilot can often handle that. Likewise, Copilot can look at a section of code and help you understand what it does. However, it’s much less effective at bigger-picture problems. Manus has done much better with this.

An important element is what these tools are and how much access I have to them. Microsoft Copilot is part of our Microsoft 365 subscription. I have not upgraded, but this gives me broad access to GPT 5.2 model. With Manus, I have a subscription, so I get 300 tokens per day and a few thousand more per month. This is an agentic AI, so I can do things like give it a GitHub token, tell it to review an issue, ask clarifying questions, then create a PR to address the issue. I’m using the free level of Claude, but have given it access to a few directories on my laptop so that it can see (and after asking, write) files. This level gets me the Sonnet 4.5 model.

The options, capabilities, and pricing available at your organization are likely different. These tools are changing quickly; by the time you read this the models for each of mine may be very different. Understanding the costs and capabilities improves the efficiency you get from those tools.

Conclusion

You may be wondering how far I got using Manus and friends. I was successfully able to build a reasonably simple CRUD application using a database that I’m rusty on but familiar with. For the UI, Manus introduced a library that I later learned has come into common use (I write some UI code, but I tend to focus on the back end).

For getting the existing project unstuck, Manus created a PR that solved an issue that had been outstanding for months (largely due to prioritization).

My takeaway is that a knowledgeable developer can realize productivity gains from AI. As with most things, you’ll get out of it what you put into it. A developer with sloppy habits will generate sloppy code much faster. There are cases where that can be okay — if the job is low-priority, low-impact projects where errors won’t cause problems. To build production-ready systems that people will rely on, technical leaders need to emphasize taking responsibility and pride in work done well. Focus on avoiding “slopware engineering”!

Data Agility Starts with Smart Technology Choices: How Rigid Schemas Hold You Back

Dave Cassel — Mon, 28 Jul 2025 17:53:00 +0000

In today’s data-driven world, agility isn’t just a competitive advantage—it’s a necessity. Organizations are increasingly building data hubs to break down data silos from across the enterprise, enabling better analytics, faster decision-making, and more responsive operations. But as any data leader knows, integrating data from diverse sources is no easy task.

Data sources often express similar information differently, with varying structures, vocabularies, and labels. Figuring out a common representation of all that data is a major project in itself, before you can even start to take advantage of it. The technology choices you make around how data is stored and modeled can either enable your team to move quickly or bog them down in complexity.

This post explores why rigid schemas are a poor fit for modern data hubs, and how a more flexible, layered approach to data modeling can unlock true data agility.

The Challenge of Diverse Data Sources

When building a data hub, one of the most common challenges is dealing with the diversity of incoming data. Different data sources are accessed in different ways, and they each have their own way of expressing information. Field names differ, data types vary, and even the meaning of similar-looking data can shift depending on context.

This diversity isn’t just a technical inconvenience—it’s an obstacle to creating a single, unified view of your business. Before you can analyze or make use of the data, you need to reconcile these differences. For many organizations, that means looking in different systems for subsets of data—trying to find all the puzzle pieces needed to assemble the complete data picture. Choosing to build a unified data hub means mapping fields, aligning semantics, and often transforming the data into a common format that your systems and teams can work with.

This harmonization process is essential, but it’s also time-consuming and resource-intensive. And it becomes even more complex when the underlying data sources evolve. New fields appear, formats change, or entirely new systems are added to the mix. If your data platform isn’t built to accommodate this kind of change, your team will constantly be playing catch-up.

The Pitfall of Rigid Schemas

Traditional databases with a single model often rely on rigid schemas—fixed structures that define exactly what data can be stored and how it must be formatted. While this approach works well for systems with stable, predictable data structures, it becomes a major obstacle when building a data hub that needs to integrate and evolve.

Rigid schemas force teams to define the entire structure of the data up front. That means anticipating the type, cardinality, and relationships of every field before you can even begin ingesting data. This delays time-to-value and creates a bottleneck for innovation.

But in reality, not all data is equally valuable at the beginning of a project. A more agile approach is to identify a subset of the data that delivers immediate business value. Finding the key fields that support analytics, reporting, or operational decisions, then harmonizing that core data allows you to move quickly and start delivering insights that drive your business.

At the same time, it’s critical to preserve the rest of the data in a way that’s associated with the correct records and accessible for future use. Storing this supplementary information as JSON alongside the harmonized data provides a flexible, schema-light way to retain context without locking yourself into a rigid structure. As your understanding of the data grows or new use cases emerge, you can progressively enrich your schema without reengineering your entire pipeline.

This iterative approach of using a core schema and extensible raw data enables agility, adaptability, and long-term scalability.

A More Agile Approach to Data Modeling

To support agility, your data platform needs to accommodate both structure and flexibility. That means moving away from an “all-or-nothing” schema mindset and embracing an iterative approach to data modeling.

Start by defining a core schema that captures the most valuable and commonly used data elements. This schema should be designed to support a small number of initial use cases without trying to account for every possible variation or future need.

Then, instead of discarding or ignoring the rest of the data, store it in a flexible format that can be associated with the same records. JSON works well for this purpose. This allows you to retain the full context of the original data without forcing it into a rigid mold. When new requirements arise, or when deeper insights are needed, you can revisit that raw data and selectively incorporate additional fields into your harmonized model.

This approach offers several key benefits:

Faster onboarding of new data sources: You don’t need to fully model every source before you can start using it.
Incremental schema evolution: You can adapt your model over time as business needs change.
Preservation of context: You retain access to the original data, which can be critical for auditing, troubleshooting, or future enrichment.

By combining a well-structured core with an easily accessible record of the upstream data, you create a data foundation that’s both stable and adaptable—ideal for the dynamic needs of modern enterprises.

Technology Selection Criteria for Data Agility

Choosing the right technology is critical to enabling data agility. Performance and scalability are important, but your platform also needs to support the evolving needs of your data and your business. The goal is to move quickly, adapt easily, and preserve flexibility as your requirements change.

Here are some key capabilities to look for when evaluating data management technologies:

Support for semi-structured data: Look for a platform that includes native support for formats like JSON. This allows you to retain raw data alongside structured fields, preserving context without forcing premature modeling decisions.
Extensibility and adaptability: Your platform should make it easy to add new fields, support new data types, and evolve your schema without disrupting existing pipelines or applications.
Integration-friendly architecture: The goal of a data hub is to have one source providing data for many services. Enabling connectivity is a key component of the project. Choose technologies that offer robust APIs and connectors while supporting common data exchange formats to simplify integration with upstream and downstream systems.
Scalability: Even while you focus your data modeling needs on the short- to medium-term, planning for the future is essential. Your technology should scale not only in terms of data volume, but also in complexity. Over time, you may add use cases, enrich your data models, and add more sources.

By prioritizing these capabilities, you’re not just choosing a tool—you’re laying the foundation for a data strategy that can grow and adapt with your organization.

Real-World Implications

The technical decision of how you approach data modeling may seem like an implementation detail, but it is central to your team’s ability to move faster, respond to change, and deliver value quickly.

For example, when onboarding a new data source, a rigid schema might require weeks of upfront modeling and integration work. With a more flexible approach, you can ingest the data immediately, harmonize the most important fields, and start generating insights while preserving the rest for future use.

This agility also supports innovation. As new opportunities emerge, your team can quickly adapt the data model to support their needs without reengineering your entire pipeline. That means faster time-to-insight, lower integration costs, and a more responsive data strategy overall.

In short, data agility empowers your organization to treat data as a living asset—one that evolves with your business, rather than holding it back.

Conclusion

With any software project, there are some aspects that are worth getting right in the beginning. Adding security or accessibility to an almost-done project is much more complex than including them in your approach from the beginning. Likewise, a solid foundation to your data modeling strategy gives you the flexibility to move quickly, adapt to change, and evolve over time.

Rigid schemas and inflexible platforms slow progress. They force premature decisions, delay time-to-value, and make it harder to respond to new opportunities. By contrast, a layered, iterative approach to data modeling—harmonizing what matters now and preserving the rest for later—gives your team the tools they need to shorten time-to-value and adapt as your needs grow.

Choosing the right technology is a critical part of that journey. Look for platforms that support semi-structured data, enable schema evolution, and make integration seamless. With the right foundation, your data hub becomes more than a repository—it becomes a catalyst for innovation.

Winning Hearts and Minds: Strategies for Team Buy-In on Data Projects

Dave Cassel — Fri, 22 Nov 2024 15:19:00 +0000

When it comes to Big Data initiatives, selecting the right tools, platforms, and architectures is important. Often overlooked is the importance of getting your team on board with the project and the approach. Gaining buy-in across business units is critical to the success of any data project. Without it, even the best technologies can fail to deliver meaningful results.

I have seen the impact of buy-in—or a lack of buy-in—and how that affects project outcomes. In this post, we’ll identify the key stakeholders you need to engage, provide actionable strategies to build consensus and enthusiasm for your data project, and look at the impact of this important step.

Why Team Buy-In Is Critical

Data projects often affect multiple parts of an organization, impacting business decision-making and day-to-day workflows. While the technical and financial aspects of a project are vital, its ultimate success depends on how well it integrates into the organization’s culture and processes.

Here’s why buy-in matters:

Smooth Implementation: When stakeholders are pulling in the same direction, it’s easier to address roadblocks and adapt to challenges.
Faster Adoption: Teams that feel involved are more likely to embrace the new tools or processes.
Better Insights: Input from end users ensures the project aligns with real-world needs, leading to more actionable insights.

Without buy-in, projects may face resistance, limited adoption, or even outright failure, wasting both time and resources.

Key Stakeholders in a Data Project

To achieve comprehensive buy-in, you need to engage all relevant stakeholders. These typically fall into four main groups:

Business Leaders

Why They Matter: They set organizational priorities and approve budgets. Their support can make or break the project.
What They Care About: ROI, alignment with business strategy, and competitive advantage.
What I’ve Seen: My team and I worked on a project that lacked buy-in from upper management. When the project ran into some challenges, an executive started making significant changes. These led to higher overall costs and delays in completing the project. A different client had solid buy-in from all stakeholders. When the original project champion left the company, the broad base of support allowed the project to move forward without a hitch.

Finance Teams

Why They Matter: They control the purse strings and evaluate the financial feasibility of the project.
What They Care About: Cost-effectiveness, long-term value, and measurable results.
What I’ve Seen: Small projects might be able to happen under the radar, but impactful projects will have some costs. These may include software licenses, infrastructure, and outside help. Working with finance teams to align the approach with the organization’s financial plan helps get them on board. Surprises about costs, on the other hand, raise red flags.

Technology Teams (IT and Data Specialists)

Why They Matter: They are responsible for implementing and maintaining the project.
What They Care About: Compatibility with existing systems, scalability, and technical feasibility.
What I’ve Seen: There are many approaches for data projects with new options becoming available all the time. Mastering new skills takes time. If the team sees that the approach being taken will be productive and they are given the time to come up to speed with new tools, they will be less likely to push back and want to revert to what they’ve done before. They need to feel confident in their ability not only to build the project, but to maintain it for the long term.

End Users

Why They Matter: They interact directly with the tools and processes, and their participation determines adoption success.
What They Care About: Usability, how it impacts their workflow, and training/support availability.
What I’ve Seen: I worked on a project that had leadership, budget, and the technical team aligned. We built a solid system with lots of data and interesting reports. Unfortunately, this was a “build it and they will come” project — but they didn’t. Better engagement early on could have produced better alignment with user needs.

Strategies to Build Team Buy-In

To create excitement and alignment around your data project, focus on involving stakeholders at every stage. Below are strategies tailored to each group:

Involve Stakeholders Early

What to Do: Begin discussions with stakeholders before making key decisions. This includes business leaders, tech teams, and end users.
Why It Works: Early involvement helps identify concerns, align goals, and create a sense of ownership. If part of the benefit of the project is to reduce manual work, consider what that means for the people who are doing that work today. Ideally it means they can focus on higher-value, more rewarding efforts. Make sure that is communicated clearly.

Develop a Clear Business Case

What to Do: Create a compelling narrative that connects the project to organizational goals. Use metrics to emphasize the potential ROI and benefits.
Why It Works: A strong business case appeals to both business leaders and finance teams by tying the initiative to measurable success. When a project is tied to a compelling business need, it’s easier for all parties to grasp the value of seeing it through.

Create Prototypes or Pilot Projects

What to Do: Implement a small-scale version of the project or a specific feature for testing. Involve end users in the process.
Why It Works: Prototypes demonstrate value and practicality, giving stakeholders a chance to see the project in action and provide feedback. This also gives an opportunity for course correction if needs have changed. Involving end users creates anticipation as people see that the project will help them with their jobs.

Focus on Communication and Transparency

What to Do: Regularly update the various stakeholders on progress, challenges, and milestones. Use clear, non-technical language to ensure everyone understands.
Why It Works: Transparent communication builds trust and keeps everyone aligned toward the project’s goals.

Provide Training and Support

What to Do: Offer tailored training sessions and resources to help end users understand the new tools and processes. Provide ongoing support to address concerns. Solicit feedback from end users — and act on it where practical.
Why It Works: Empowering end users with knowledge reduces resistance and encourages adoption. Improving based on user feedback conveys responsiveness and a desire for the user community to be successful.

Celebrate Successes

What to Do: Acknowledge milestones and celebrate small wins throughout the project. Recognize team members who contribute to success.
Why It Works: Celebrating progress boosts morale and reinforces the value of the project. Hitting milestones helps business leaders and the finance team see that their investment is paying off.

Overcoming Common Challenges

Despite best efforts, resistance can arise. Here’s how to address common issues:

Stakeholder Apathy: Provide regular updates — or better yet: demos — to stakeholders to show concrete progress. Emphasize the personal and organizational benefits of the project. Discuss the impact of the project not moving forward. Sometimes the right answer is that the cost exceeds the benefits. Being realistic about this early in the process builds trust.
Technical Concerns: Run proof-of-concept sub-projects to address the concern. This might include ability to perform at scale, pace of development, security, or other concerns. Be laser focused on addressing the concern. Ensure you have the necessary expertise to give the approach a fair evaluation.
End User Resistance: Conduct workshops and listen to concerns. Highlight how the project will make their work easier and more efficient. Consider whether the resistance comes from not addressing an important need (figure this out early!), usability, performance, or job concerns.

The Bottom Line

Winning hearts and minds is an essential part of any Big Data initiative. These projects are often at the heart of an organization’s processes with potential for big impact. By involving stakeholders early, communicating effectively, and demonstrating the project’s value, you can build the support needed to ensure success.

Remember, technology may be the backbone of your project, but people are the heart. Investing time and effort into gaining team buy-in will pay dividends in smoother implementation, faster adoption, and better results.

Working with JSON in XQuery

Dave Cassel — Fri, 07 Sep 2018 18:28:11 +0000

MarkLogic supports two native languages, XQuery and JavaScript. XQuery is a very natural way to work with XML, just as JavaScript is a very natural way to work with JSON. However, sometimes it’s useful to cross over and work with JSON using XQuery or vice versa. This post has some tips on using XQuery to work with JSON in memory (to update JSON in the database, use the xdmp:node* functions).

JSON Nodes

The first thing to know is that JSON that we get from the database comes in an immutable form. Specifically, that’s a JSON Node. We can directly construct JSON Nodes using constructors, like this:

object-node {
Â  "a": 1,
Â  "b": 2
}

If all you want to do is retrieve a node from the database, or a part of one, working with nodes is great. One of the cool things is that we can apply XPath to nodes:

let $obj := object-node {
  "a": 1,
  "b": 2
}
return $obj/a

This makes it easy to select a portion of a JSON document; we might do that when we pull original JSON content out of an envelope document, for instance. We can also use this to pull data that is deep in a node structure, using a path like “/envelope/instance/TopProperty/lowerProperty”.

Changing

If we want to edit a JSON structure, we need a different approach. There are two ways we can go about it: recursive descent over JSON nodes, or by converting to maps. I’ll give recursive descent its own post later; for now, I’ll talk about maps.

A map:map is a mutable key-value structure. It can be hierarchical, because the values can themselves be maps. This should sound familiar: I could give that same description for a JavaScript object. If fact, when we want to represent JSON nodes in a mutable way, the way to do it is to convert them to the map:map structure. Then we can use map:put() to change an existing value or add a new one, and map:delete() to delete an existing value. When we’re done, we can convert back to JSON to update the database or send on to a client.

let $obj := object-node {
  "a": 1,
  "b": 2
}
let $map := xdmp:from-json($obj)
let $_ := map:put($map, "c", 3)
return xdmp:to-json($map)

It’s worth noting here that what we get back from xdmp:from-json() isn’t just a standard map:map, but a json:object. These behave very much the same, except that 1) its default serialization is as JSON, rather than the XML that map:map uses, and 2) it maintains the key order.

Note that while maps are great for manipulating data, we can’t apply an XPath like we can with a JSON node.

What Am I Looking At?

Query Console cheerfully presents JSON data such that it looks like JSON, regardless of whether you’re looking at JSON nodes or json:objects. This can make it hard to know what exactly you’re looking at. Knowing the format of your JSON data tells you how to interact with it. You can identify an item of each representation using instance of tests.

declare function local:report($item)
{
  "object node: " || $item instance of object-node() ||
  "; json:object: " || $item instance of element(json:object) ||
  "; map:entry: " || $item instance of map:map
};

let $node := object-node { "foo": "bar" }
let $obj := json:object() 
let $_ := map:put($obj, "foo", "bar")
let $json-obj := {$obj}/node()
let $map := map:new(map:entry("foo", "bar"))
let $to-json := xdmp:to-json($obj)/node() (: xdmp:to-json returns a document node :)
let $to-json-map := xdmp:to-json($map)/node()
let $from-json := xdmp:from-json($node)
return (
  "node:        " || local:report($node),
  "json-obj:    " || local:report($json-obj),
  "map:         " || local:report($map),
  "obj:         " || local:report($obj),
  "to-json:     " || local:report($to-json),
  "to-json-map: " || local:report($to-json-map),
  "from-json:   " || local:report($from-json),
  "fn:data:     " || local:report(fn:data($node))
)

Results:

node:        object node: true; json:object: false; map:entry: false
json-obj:    object node: false; json:object: true; map:entry: false
map:         object node: false; json:object: false; map:entry: true
obj:         object node: false; json:object: false; map:entry: true
to-json:     object node: true; json:object: false; map:entry: false
to-json-map: object node: true; json:object: false; map:entry: false
from-json:   object node: false; json:object: false; map:entry: true
fn:data:     object node: false; json:object: false; map:entry: true

Constructing

There are two ways of building JSON data. We can use the JSON node constructors directly, or we can build up maps and pass the result to xdmp:to-json(). Here’s the constructor version:

object-node {
  "a": 1,
  "b": 2,
  "c": array-node {
    object-node { "fname": "Harrison", "lname": "Ford" },
    object-node { "fname": "Mark", "lname": "Hamill" },
    object-node { "fname": "Carrie", "lname": "Fisher" },
    object-node { "fname": "Natalie", "lname": "Portman" }
  }
}

And here’s the method using maps:

xdmp:to-json(
  json:object() =>
    map:with("a", 1) => 
    map:with("b", 2) =>
    map:with("c", 
      json:array() =>
        json:array-with((
          json:object() => map:with("fname", "Harrison") => map:with("lname", "Ford"),
          json:object() => map:with("fname", "Mark") => map:with("lname", "Hamill"),
          json:object() => map:with("fname", "Carrie") => map:with("lname", "Fisher"),
          json:object() => map:with("fname", "Natalie") => map:with("lname", "Portman")
        ))
      )
)

Personally, I find the direct constructor approach more natural. An important consideration is what you’re going to do with the structures once you have them. If you plan to modify them, go with the maps, do whatever modification you need, and then pass the finished product to xdmp:to-json — don’t convert back and forth. Likewise, if you’re constructing JSON to return to a client and won’t be persisting it in the database, stick with maps; you’ll find it runs faster.

Wrap Up

A little understanding goes a long way. I’ve been working with JSON data using XQuery as part of the Smart Mastering project; having gotten a better understanding of the data structures involved will let me write better code.

Data Hub Framework Flows

Dave Cassel — Tue, 14 Aug 2018 04:06:02 +0000

The Data Hub Framework is a feature recently added to MarkLogic that makes it easier to gather data from a variety of sources and build a common representation across the original formats. I learned some useful things about working with the framework that I thought were worth writing down (partly so that I’ll remember them).

My colleague at MarkLogic, Paxton Hare, started the MarkLogic Data Hub Framework project early in 2016. In the Spring of 2017, I joined him on the project. The requirements for this project were drawn from the developers who were building operational data hubs for customers.

Types of Flows

Paxton once told me that he thought about naming the types of flows differently: instead of “input” and “harmonize” flows, he thought “real-time” and “batch” would better describe what they do. I like the term “streaming” instead of “real-time”, to avoid confusion with real-time computing.

Input flows are run as transforms. Some other process sends data to MarkLogic, using MLCP, the REST API, or one of the libraries built on top of the REST API, and an input flow transforms the data along the way. Input flows have no writer, because the flow itself is not responsible for persisting the data. These can be thought of as “streaming” flows, in that the flow is applied to a document between an external process sending a document and MarkLogic persisting it. These are called input flows because they mark the first point of entry into the database.

Harmonize flows are a process in themselves. The first step of a harmonize flow is identifying which identifiers it will work on. (These identifiers may be URIs of documents already in MarkLogic, but they could also be values to search for in a MarkLogic database or they could identify resources to pull in from an external data source.) The flow then transforms and writes each document in turn. This represents a batch approach to modifying documents. They were originally called harmonize flows because this was often the stage where documents were copied from the staging database to final, harmonizing some properties along the way.

When first built, the common pattern of use was that input flows were used to bring data into the staging content database, then harmonizeÂ flows were used to turn raw data in the staging database into commonly-structured envelope documents in the final database. This pattern became so ingrained in me originally that I didn’t notice that you didn’t need both steps. It’s perfectly reasonable to use a input flow to send data directly to the final database.

Putting Flows to Work

For example, I designed a data hub project to collect data from various MarkLogic web sites (www, developer, docs, training, help). Once gathered into a single database, a search service on that database would make discovery of available material across those sites much easier. (We currently have this across www, developer, docs, and some training material, but it would be beneficial to update the implementation and expand the reach.) For this project, the first key is identifying what the common attributes are across the source sites. I came up with the following:

url (absolute URL, including protocol)
category (technical blog post, tutorial, recipe, guide, etc.; useful as facet)
last-updated (a date-time)
tags (zero or more tags, with values at the discretion of the content providers)
title (a string suitable for display with search results)

Each data source can construct an input flow to build an envelope, with the original content stored in an attachments XML element or JSON property, and the above properties expressed under an instance element or property. (Those element/property names are chosen to be consistent with Entity Services.) This data can be written directly to the final content database. No need to have an input flow to insert data, followed by a separate harmonize flow to construct the envelopes.

Why Harmonize?

So if an input flow can write the harmonized data to the final database, why do we need harmonize flows? These are helpful when the process of building envelopes is less straightforward. As an example, when the Documentation team publishes new content, the guides are part of a large zip file. An input flow can bring the content into MarkLogic, but since it doesn’t have a writer, it won’t be able to break it up into appropriate sized chunks. A harmonize flow can break it up in separate documents and populate the envelope properties (URL, category, and so on).

If your data needs to go into MarkLogic with only a self-contained set of changes, and will remain static once there, MarkLogic recommends using an input flow to send it directly to the final content database. Consider sending data to the staging database and then using a harmonize flow to bring it to the final database if you have any of the following situations:

the original content has a significantly different form from what you want to make available in the final database (for instance, documents need to be split up)
similar to the above, if your final content database documents will be constructed from multiple input documents (as is often the case when relational data is ingested by table), send those to the staging database, then use a harmonize process to assemble the final entity documents

Iterating

A harmonize flow doesn’t have to move content from one database to another; it can also be used to update content in place.

For the web sites hub described above, suppose that we decided to harmonize an additional property, such as author. We can write a harmonize flow that bothÂ reads from andÂ writes to the final content database. Because we store the original content in the attachments element of the document envelope, the flow can extract the content from the original source, add the new property, then overwrite the existing document. We’d need to write this flow for each of the input sources, assuming that the property would be found in different places in the various sources, but this would require very little coding. We’d also need to update the various input flows so that new documents would come in with the author property.

Once we’ve updated the input flows, how do we know which documents need the harmonize update? Here’s a part of the Entity Services document model, inside the instance:

"info": {
Â  "title": "WebContent", 
  "version": "0.0.1"
}

When we add the author to the model, we increment the version number of the model. The harmonize flow’s collector plugin can then query against the old model version number.

Wrapping Up

The Data Hub Framework goes a long way to simplify the process of building an operational data hub. With a little better understanding of when to use each type of flow, your architecture will work even better.

Evolution of modeling relationships in MarkLogic

Dave Cassel — Fri, 03 Aug 2018 13:35:03 +0000

MarkLogic, as a multi-model database, can store data both as documents and as triples. We model entities as documents. Over time, the way we’ve modeled relationships has changed.

In the Beginning

Prior to MarkLogic 7, relationships were modeled by including either a URI or some other identifier as an element or attribute. In most cases, we’ll also “denormalize” some information from the linked entity into the document that refers to it. In a document describing a Person entity, we might have an element like this:


  All the Birds in the Sky
  Science Fiction

Here we have a relationship between a Person and a Book, where the person is the author of the book. The uri attribute provides the link. We include the title and genre elements because that enables us to search for the Person (the author) based on those pieces of information. If the application needs more information about the book, it can use the uri attribute the load the document.

The only downside to this is that the application needs to know not only that this relationship exists, but exactly where in the document it should look to find it.

For completeness, we can represent that same information using JSON:

{
  "author": {
    "uri": "/book/1234.xml",
    "title": "All the Birds in the Sky",
    "genre": "Science Fiction"
  }
}

These two representations are generally the same (one exception: values in attributes aren’t available for word searches).

Relationships with Triples

The semantic capabilities that MarkLogic has supported since version 7 give us a different way to represent relationships. Instead of the URI of the book document being in an attribute or property, we can represent the connection as a triple:

Â Â 
Â Â Â Â /person/7bfbc09d-ef7f-4976-bf16-763b70bf3995.xml
 Â Â Â http://example.org/wrote
 Â Â Â /book/1234.xml
Â Â

In this triple, the subject is the URI of the Person document, the object is the URI of the book document, and the predicate identifies the relationship. The triple can either be in the Person document or the Book document (an unmanaged triple) or stored with other triples (a managed triple). There are a couple benefits to this.

Identifying all relationships among entities.
Using an ontology to ask more interesting questions.

For point #1, remember that if a relationship is represented in an attribute, element, or property, the application needs to know where to look in the document to find it. With triples, however, a SPARQL query can identify all the author-book relationships very easily:

select ?author ?book
where {
  ?author  ?book
}

True, that requires knowing the relationship used to connect books and authors. But we can also ask, how is a Person related to other entities?

select ?relationship ?entity
where {
  {  ?relationship ?entity }
union 
  { ?entity ?relationship  }
}

All we need to know in this case is the entity (Person) URI that we want to inquire about. (The union keyword allows us to look for our entity URI in either the subject or object position of a triple.)

Not only does this query tell us what entities the Person entity is connected to, we’re given the predicates that connect them. One of the really cool things about using RDF triples to store relationships is that we can describe the predicates with triples, right in the database itself. For instance, the triple we showed above uses the predicate. We can add a triple like this to the database:

Â Â 
 Â Â Â http://example.org/wrote
 Â Â Â rdfs:comment
 Â Â Â Connects an author to a book written by the author
Â Â

With this in the database, we can show an end-user how the entities are connected, just by expanding our query a bit.

In addition, if we have a good ontology in our database, we can ask more interesting questions. Given the data shown above, our application can look for authors of Science Fiction books. We might be want to ask a broader question and find authors of all types of fiction. If our ontology recognizes that Science Fiction is a type of Fiction, we can use inference to include Science Fiction, Mystery, Historical Fiction, and other sub-types when we look for Fiction authors.

Using SPARQL, we can also use property paths to follow links. Suppose our data set includes links, recording the manager for each employee in an HR database. We can find one person’s boss with a simple query:

select ?boss
where {
Â    ?boss
}

Suppose we want to find the chain of managers from Ann all the way to the CEO. With the original style of recording URIs in an element, attribute, or JSON property, we’d need to retrieve Ann’s manager, do another query to find that person’s manager, and so on. We’d need the same process with a relational structure. But with SPARQL, we can add a single character to the query above: “+”.

select ?boss 
where {
 Â  + ?boss
}

The “+” says to follow that predicate 1 more times and returns a list of managers from Ann all the way to the top. SPARQL provides several “property paths” that enable more interesting queries for exploring a data set.

Template Driven Extraction

MarkLogic 9 adds Template Driven Extraction (TDE). Using TDE, we can pull information from documents directly into indexes, accessible from either SQL or SPARQL queries. We can do this without changing the original document structure itself. This approach lets us connect our XML and JSON documents to a full ontology without running transforms.

Wrapping Up

The triples model allows for discoverable connections among entities. The biggest challenge is to make use of this representational power in a useful way, by selecting predicates that link into a relevant ontology. (I like to use http://dbpedia.org/fct/ to help find relevant IRIs.) This only applies when your application has an ontology to connect to, but with that or without it, storing connections among entities, along with descriptions of those connections, allows data and its meaning to exist side-by-side. This advantage, plus the discoverability of relationships, is a clear improvement that will give your applications more powerful search.

Building a MarkLogic Data Model

Dave Cassel — Thu, 06 Aug 2015 16:53:25 +0000

This post is an excerpt from a book I’m working on: MarkLogic for Node.js Developers. This section is part of a chapter on Data Modeling, falling after a comparison to relational database modeling and a discussion of denormalization. The goal is to address the question of what should be a document in MarkLogic. The next section illustrates these points using Samplestack as a case study. Feedback welcome.Â

Building a Data Model

There are severalÂ factors to consider when deciding what to include in a document.

Document databases can hold many types of documents.

In a document database, documents that represent different types of entities can sit side by side. A book database might have separate documents for books, authors, and publishers, each containing the bulk of information related to that type.

A document is the unit of search.

When doing a search against a MarkLogic database, the typical goal is to identify which documents match a particular query. Understanding what an applicationâ€™s users will want to search for informs the types of documents you should have.

Include what will be searched for.

When considering a book database, a user might want to search for books by a particular publisher, but will probably not be looking for books published by a company based in a particular region, or founded in a certain year. Such information can be left out of the book document â€“ it will not contribute to search, so there is no benefit to repeating it. Repeating the publisherâ€™s name in book documents makes more sense. Data that is more helpful when searching for publishers will be included in publisher documents.

Donâ€™t repeat what will be updated often.

Pieces of data that will change often should be normalized. For example, a publishing companyâ€™s name will not change often, and therefore could be denormalized into other documents if searching on it is important.

Dynamically calculate values that will change quickly.

How many books has an author sold? The answer is the sum of the sales of each book the author has published. The essential data is the per-book sales. The total will change frequently; storing the total will lead to frequent updates and a need to work at the application level to ensure the number stays correct. Conversely, the total is easy to calculate at run-time and can be done very quickly using indexes.

Size documents appropriately.

In MarkLogic, the ideal document size is in the range of 10 kilobyte to 1 megabyte. Larger documents take time to read from disk when they need to be retrieved. Very small documents are less efficient, since there is some overhead introduced for each document.

Choose JSON or XML, or a mix.

In some ways, the choice between JSON and XML is a matter of preference. For a Node.js developer, JSON is a very natural choice, as JSON and JavaScript are so closely related. This book will focus on JSON.

If there is a starting data set that uses XML, the developer may choose to keep it that way in the database, but transform to JSON in response to requests for data.

There are some differences in what can be represented in JSON versus XML. XML is good for text that will be marked up. For instance, consider a document that will be passed to an entity extraction engine to identify person names, locations, organizations, dates, and other information. In some cases, we just want to know that these things exist within a document, in which case we can store it in JSON. However, if we want to mark up the document inline, so that we can later look for entities near each other, XML handles this well. XML also allows for attributes, which describe elements.

David Cassel started working for MarkLogic in 2009. Before that, he worked for Lockheed Martin.
Figure 6: Example XML data showing markup

Overall, XML is an expressive format for representing content (text in a hierarchical structure), while JSON is good for data â€“ key/value pairs, arrays, and other data that consists of scalar data at various levels of the document hierarchy.

MarkLogic is schema-agnostic.

Relational databases require a schema to describe data. XML documents stored in MarkLogic may be required to adhere to an XML schema, but this is optional. In most cases, no formal schema is used and documents with multiple, informal schemas exist within a database. This flexibility is what is meant by schema-agnostic. MarkLogic contrasts that with â€œschemalessâ€ databases, which do not provide the option to require a schema.

There is no widely accepted standard for JSON schemas at this time and MarkLogic does not support requiring a schema for JSON documents.

MarkLogic supports two-stage queries.

Although MarkLogic documents are typically denormalized, sometimes a query requires some data from one type of document to query a different type of document. Data modeling in MarkLogic seeks to minimize this, but when necessary, an application can do a two-stage query. This is effectively a join and avoided where practical for the same reasons it is problematic for relational databases â€“ two stage queries are necessarily slower than a single-stage query.

Memoization in XQuery

Dave Cassel — Mon, 28 Jul 2014 21:33:32 +0000

Memoization is tracking partial solutions so that they don’t have to be recalculated. A good example of where this is handy is the Fibonacci sequence. You may remember that the definition of this is:

F(n) = F(nÂ -1) + F(nÂ – 2)
F(1) = 1
F(2) = 1

Clearly, this is a recursive function. Let’s take a look at the naive implementation:

The good news is that we get the right answer.

local:fib(10) => 55

The bad news is that this is a seriously inefficient way to do it. Why? Let’s look at what happens when we calculate local:fib(10)

F(10) = F(9) + F(8)

As we calculate F(9), we’ll figure out F(8) and F(7). Our function will recurse through these until it has an answer, bringing us to

F(10) = 34 + F(8)

Now we recurse on F(8) to find that value. But wait, we already calculated F(8) as part of finding the value of F(9)! We need a way to keep track of the values we’ve already calculated so that we don’t have to recomputed them.

Enter the map.Â Maps, or associative arrays, are a tool available in many programming languages, including MarkLogic’s XQuery. While not part of the XQuery 3.0 standard, this is a tool you want in your belt.Â You’ll findÂ some great documentation on the mapping operators; my goal here is to show a technique where maps will help with your runtime performance.

Let’s try another version of that function.

The base case remains the same, but when we’re computing other values, we put them into the map as they are calculated.

Performance

Let’s see what kind of difference this makes. I set up each of these functions in Query Console and called each function with n = 20 while looking at the Profile tab.

Naive version: 85,353 expressions; about 0.046 seconds.

Improved version: 354 expressions; about 0.00034 seconds.

That’s two orders of magnitude fewer expressions that needed to be run, reducing the run time by a similar amount.

This approach is useful for problems where:
1. you have a recursive solution
2. where some of the calculations would otherwise be repeated

Loading JSON into MarkLogic 7

Dave Cassel — Mon, 30 Jun 2014 22:04:38 +0000

This post shows how to ingest JSON into MarkLogic 7 using mlcp. Unlike many, this one is very specific to MarkLogic 7.

Since the release of MarkLogic 6,Â MarkLogic Content Pump (mlcp) has been the supported tool for importing, exporting, and copying content. One feature that’s missing from it is the ability to load JSON files without having them stored as text files. To expand on that, let me point out that MarkLogic 7 is part of a transition in how MarkLogic handles JSON files.

MarkLogic Version	handles JSON as
5	text
6, 7	quietly converted to XML
8	native type

In MarkLogic 5, JSON documents are stored as text. As with any text document, that lets you do word searches, but you’re not able to use the structure.

In MarkLogic 6 and 7, you can load JSON using the REST API and MarkLogic quietly converts it to an XML format. When you request the document back, MarkLogic quietly converts it back toÂ JSON. The reason for this is that handling JSON was a goal for MarkLogic 6, but it’s done at the REST API level — internally, actual JSON would just be text, preventing us from building indexes and otherwise working with the structure. By converting it to XML internally, we can do much more with it.

In MarkLogic 8, JSON is planned to be a native type. I tested loading JSON with mlcp today on an ML8 development build, and mlcp loads JSON as native content, meaning that the structure is accessible without having to do anything special.

Ingest via REST API

We can ingest JSON and have it transparently converted by POSTing them to the REST API. Here’s how to load a directory of JSON documents.

for f in ~/data/json-data/*.json; do
curl --anyauth --user admin:admin -X POST -d@$f -i \
Â  -H "Content-type: application/json" \
Â  'http://localhost:8040/v1/documents?extension=json&directory=/content/';
done

That works great, but for larger amounts of data, you lose out on mlcp’s ability to parallelize the workload.

Ingest via MLCP

MarkLogic’s documentation describes how to use a transform with mlcp. Here’s a simple transform that applies MarkLogic’s json:transform-from-json() function:

And here’s the call to have mlcpÂ use it:

MarkLogic, Angular, and node: Authentication

Dave Cassel — Sat, 31 May 2014 02:46:20 +0000

My first effort with the DemoCat application had a pretty simple architecture: node hosted static resources and passed REST requests straight through to MarkLogic. This approach is more in keeping with the three-tier architecture that MarkLogic’s customers typically use, but myÂ middle tier was doing almost nothing. I’ve made some improvements, especially related to user management. I should point out that I’m new to node, so if this looks off-base, let me know, but this looks like moving the right direction.

Current Approach

In my first effort, node.js hosted the HTML/JS/CSS; the UI was built with AngularJS; and the node layer proxied requests to MarkLogic’s REST API straight through without looking at them at all.

What’s Wrong with Pass-through

The problem with this approach is that it effectively exposes the MarkLogic REST API to the end-user, which is a security vulnerability. (Application Builder directly exposes the REST API, too, with no middle tier.) MarkLogic does have a robust security model; it’s possible to lock down an application using that model, but there are a few reasons why I like doing so in the middle tier.

Role versus User Permissions

Consider the documents in which we store user profiles. In Demo Cat, I’m storing the user’s full name and a list of email addresses that the user has access to. In MarkLogic, we set document permissions with roles. Users who belong to a role that “update” permission may update a document. That means that to let my user (dcassel) update my profile, but not update another profile document, I would set up a “dcassel” role with update permission on the profile document. Only the dcassel user would get the dcassel role. This would work, but it isn’t in the spirit of how roles are intended to be used; rather, the intended use is that a role represents a group of users.

Note: there are certainly applications out there where roles were created for individual users, and that was done for legitimate good reasons. It’s not that you can’t or shouldn’t, but if you find yourself creating roles for individual users, it’s worth taking a look to see whether another way makes sense.Â

There are other ways to limit who can update a profile document. For instance, I could create an extension specifically for doing so and have that extension check whether the current user matches the profile’s owner. However, doing that would still leave a backdoor open, in the sense that /v1/documents would need to be locked down.

Mindset

A big difference between limiting access to the MarkLogic REST API versus using internal security to limit what can be done with it is the mindset we use to approach it. If we directly expose the MarkLogic REST API to the end user, we’re in a black-list mentality — we need to figure out what to shut down. But if deny direct access to the REST API, we can approach it as a white list, deciding which features to expose, how, and to whom. To make an application secure, that makes more sense to me.

Mindset Part 2

Another note on mindset — MarkLogic is capable of hosting applications completely, without using a middle tier at all, but some customers don’t want that. They prefer to have a business logic tier with a hard line on a diagram connecting it to the database, rather than the database actually hosting the middle tier. (For anybody who read that and thought, “who would do that?”, when you have such a capable language running at the database level, you really can express all your business logic there. But I get it, you want more distance between the layers.) When you design a system with distinct business and database tiers, you’re not going to turn around and expose the database tier directly to the end user.

MarkLogic Security Still Useful

You should definitely not think, by reading the above, that MarkLogic security stops being useful when you introduce a middle tier. Far from it! With MarkLogic security, you can decide which groups of users (roles) can see which documents, then these restrictions are accounted for when performing searches. You can also easily prevent unregistered users from creating or deleting content, even in cases where your middle tier doesn’t know enough to decide what’s allowed. You have multiple levels at which to apply security, allowing you to focus on content (documents, directories, etc) or access methods (API).

Middle Tier Access Control

Okay, locking down the REST API makes sense, but does that mean we need to write custom endpoints for every touch on the database? For some systems, the answer is probably yes. Given that my goal is rapid application development, but trying to be on a path toward production applications, I’ll take the middle road. I want the client code (written in AngularJS) to be as portable as possible from one application to another.

Node.js hosts my static resources and exposes endpoints for functionality. For the sake of portability, my AngularJS code will make requests to REST API URLs. However,Â those requests go to node, where I can decide what to do with them. Consider this function:

Session management is done at the node layer, so if the user hasn’t logged in, no PUTÂ requests are allowed. If the user has logged in and is PUTing a user profile, the profile has to be the one that belongs to the user. (I haven’t yet written code that will update the in-memory user profile based on the updates that have been sent.) The proxy() function simply relays the request on to MarkLogic, complete with credentials identifying the user.

You can see the rest of the middle tier at Demo Cat’s GitHub. You’ll see that GET requests are proxied right through (at the moment, I’m intending to change that); PUT and POST require the user to have logged in; DELETE requests are not supported at all; and a few /user/ endpoints govern logging in and out.

Locking Down MarkLogic’s HTTP App Server

The last step in controlling access to the REST API is to prevent people from bypassing your application and going straight to the REST API’s app server. This can be accomplished simply by having a firewall not allow external access, or you can set the MarkLogic HTTP app server’s address field to “localhost”, thus only allowing requests from the same server that MarkLogic is running on.

Software Development – David Cassel

Software Development in the Age of AI

Tools

Roles

Specify

Review

QA

Fix What it Can’t

Further Considerations

Security

Subscription Costs

Conclusion

Data Agility Starts with Smart Technology Choices: How Rigid Schemas Hold You Back

The Challenge of Diverse Data Sources

The Pitfall of Rigid Schemas

A More Agile Approach to Data Modeling

Technology Selection Criteria for Data Agility

Real-World Implications

Conclusion

Winning Hearts and Minds: Strategies for Team Buy-In on Data Projects

Why Team Buy-In Is Critical

Key Stakeholders in a Data Project

Business Leaders

Finance Teams

Technology Teams (IT and Data Specialists)

End Users

Strategies to Build Team Buy-In

Involve Stakeholders Early

Develop a Clear Business Case

Create Prototypes or Pilot Projects

Focus on Communication and Transparency

Provide Training and Support

Celebrate Successes

Overcoming Common Challenges

The Bottom Line

Working with JSON in XQuery

JSON Nodes

Changing

What Am I Looking At?

Constructing

Wrap Up

Data Hub Framework Flows

Types of Flows

Putting Flows to Work

Why Harmonize?

Iterating

Wrapping Up

See Also

Evolution of modeling relationships in MarkLogic

In the Beginning

Relationships with Triples

Template Driven Extraction

Wrapping Up

Building a MarkLogic Data Model

Building a Data Model

Memoization in XQuery

Performance

Loading JSON into MarkLogic 7

Ingest via REST API

Ingest via MLCP

MarkLogic, Angular, and node: Authentication

Current Approach

What’s Wrong with Pass-through

Role versus User Permissions

Mindset

Mindset Part 2

MarkLogic Security Still Useful

Middle Tier Access Control

Locking Down MarkLogic’s HTTP App Server