Splitting data with Info Studio

Author: Dave Cassel  |  Category: Software Development

Today we’ll play “Spot That Mistake”. This will probably be really easy for some but it threw me off for a bit.

My goal

I wanted to bring some data into MarkLogic Server just to give me something to work with while exploring some features. I grabbed an XML file that represented the geocaches that I’ve found. The file consists of a single <gpx:gpx> root node with 131 <gpx:wpt> children elements, each representing a cache.

I decided to use MarkLogic Server’s Information Studio feature to bring the data in, in part because I wanted to see how easy it would be to use that feature to split the data. I wanted each <gpx:wpt> element to become its own file. (For the record, since I wanted to split up this one file, the better approach would have been recordloader or even some quick XQuery in CQ; however, I specifically wanted to try out Information Studio.)

Information Studio

Setting up Information Studio to load data is a three step process: you tell it how to collect the data, tell it about any transformations you want to do, and you tell it where to execute the load (into which database).

The collection stage was pretty simple. I used a Directory Loader, pointed it to the directory where my .gpx file lives, and told it to load the .gpx files it found. Since .gpx is not an extension that MarkLogic Server knows about by default, I also specified that the file should be loaded as an XML file.

Likewise, the load stage was quite simple. I pointed Information Studio to the database I’d set up for this purpose, and I changed the URI it where it would be stored to put it into an “/upload/” directory.

The transformation was where I figured I would do the splitting. I added a custom XQuery step, added the namespaces that I needed, and put in a FLWOR statement to call xdmp:document-insert() for each of the <gpx:wpt> nodes that it found. Here is my original try:

xquery version "1.0-ml";
(: Copyright 2002-2011 MarkLogic Corporation.  All Rights Reserved. :)

(:
:: Custom action.  It must be a CPF action module.
:: Replace this text completely, or use it as a template and 
:: add imports, declarations,
:: and code between START and END comment tags.
:: Uses the external variables:
::    $cpf:document-uri: The document being processed
::    $cpf:transition: The transition being executed
:)

import module namespace cpf = "http://marklogic.com/cpf"   at "/MarkLogic/cpf/cpf.xqy";

(: START custom imports and declarations; imports must be in Modules :)
declare namespace gpx ="http://www.topografix.com/GPX/1/0";
(: END custom imports and declarations :)

declare option xdmp:mapping "false";

declare variable $cpf:document-uri as xs:string external;
declare variable $cpf:transition as node() external;

if ( cpf:check-transition($cpf:document-uri,$cpf:transition)) then
    try {
       (: START your custom XQuery here :)
        xdmp:log(fn:concat("transforming ", $cpf:document-uri,
            "; transition=", xdmp:quote($cpf:transition))),
        let $doc := fn:doc($cpf:document-uri)
        return
            for $wpt in $doc//gpx:wpt
            return
                xdmp:document-insert(
                    fn:concat("/content/", $wpt/gpx:name),
                    $wpt
                ),
       (: END your custom XQuery here :)
       ,
       cpf:success( $cpf:document-uri, $cpf:transition, () )
    }
    catch ($e) {
       cpf:failure( $cpf:document-uri, $cpf:transition, $e, () )
    }
else ()

The problem

When I ran it, I got an unexpected result: nothing. No 131 documents, no errors, nothing. I used CQ to verify that my namespaces and XPath were correct, and they were. In fact, running the same code in CQ did exactly what I wanted.

Can you spot the error? I’ll pause while you think it about.

The solution

The answer to this one is simply that the code above was not executing against the database I’d specified in the Load step, but against the Fab database that Information Studio uses for its bookkeeping. My geocache documents had been inserted into that database. I solved this problem by changing the code above to use an eval statement against my target database. That did the trick.

        let $doc := fn:doc($cpf:document-uri)
        return xdmp:eval('
            declare namespace gpx ="http://www.topografix.com/GPX/1/0";
            declare variable $doc external;
            for $wpt in $doc//gpx:wpt
            return
                xdmp:document-insert(
                    fn:concat("/content/", $wpt/gpx:name),
                    $wpt
                )',
            (xs:QName("doc"), $doc),
            <options xmlns="xdmp:eval">
                <database>{xdmp:database("geocaching")}</database>
            </options>)

As I mentioned earlier, this isn’t the approach I’d actually use for this particular problem. But I did learn that if I want to access my data during transformation, use an eval to make sure I’m hitting the right database.

Tags: ,

3 Responses to “Splitting data with Info Studio”

  1. Geert Says:

    Actually, derived documents do should be inserted into the current Database (which is Fab) to enable the information studio to do its book keeping properly. In order to do this book keeping properly you should replicate the info: document properties, overriding the info:source-location one with a value pointing to the current one. You should also copy the collection. This allows the unload feature to also include these derived documents.

    Another way is to split inside the collector. There are some good examples here: https://github.com/marklogic/infostudio-plugins

  2. Gnanaprakash Bodireddy Says:

    It works but how about unloading documents.
    I tried your approach and it has loaded documents into my target database. But, when i tried to unload the document it is not unloading the chunked documents.

  3. Dave Cassel Says:

    Gnanaprakash,

    As Geert noted, InfoStudio knows which documents to unload by marking them with a collection. A simple approach is to mark any derived documents with the same collections as the source document. This will include a collection named for the ticket id of the load, causing derived documents to be deleted on unload, along with the source documents.

Leave a Reply