Converting CSV to XML

Author: Dave Cassel  |  Category: Software Development

I just published an XQuery library to GitHub to convert lines of Comma Separated Value data to XML. That can probably be done more simply if you don’t care what the XML looks like, but I wanted to build templates so that I could get good structure. I also built in an action mechanism, so that you can apply functions as the XML gets built. That makes it easy to format dates nicely, do geocoding, or whatever you need.

My GitHub repo has the library in the context of a Roxy framework so that I could write some unit tests. Those tests not only give me confidence that the code works, they also show you how to use the code. Here’s an example template I used in a unit test:

<curio:item>
  <curio:type src="1"/>
  <curio:name src="2"/>
  <curio:location>
    <curio:label src="3"/>
  </curio:location>
  <curio:acquired-date>
    <curio:start src="4" ns="http://davidcassel.net/csv" action="reformat-date"/>
    <curio:end src="5" ns="http://davidcassel.net/csv" action="reformat-date"/>
  </curio:acquired-date>
  <curio:description src="6"/>
  <curio:notes src="7"/>
  <curio:owner src="8"/>
</curio:item>

As you can see, a lot of the elements have a “src” attribute. The number corresponds to a field position in the CSV line. If the corresponding value is empty, the element gets skipped. The “ns” and “action” attributes specify a function that you want to run on an element after it has values put in. (At some point, I’ll try to add an “at” attribute so you can specify the library module; right now it’s assumed to be in csv-lib.xqy.) Although my examples have the action attribute on leaf nodes, it can be higher up the chain.

Here’s an example CSV line:

pin,Pilatus,"Pilatus mountain",7/1/1984,7/31/1984,,"a word, and another",dcassel

The library will handle commas embedded in quoted strings. And for completeness, here’s the generated XML:

<curio:item xmlns:curio="http://davidcassel.net/curio">
  <curio:type>pin</curio:type>
  <curio:name>Pilatus</curio:name>
  <curio:location>
    <curio:label>Pilatus mountain</curio:label>
  </curio:location>
  <curio:acquired-date>
    <curio:start>1984-07-01</curio:start>
    <curio:end>1984-07-31</curio:end>
  </curio:acquired-date>
  <curio:notes>a word, and another</curio:notes>
  <curio:owner>dcassel</curio:owner>
</curio:item>

After you’ve downloaded the repo, here are the steps for you to be able to run the unit tests:

  1. Check whether the ports I’m using conflict with any app servers you already have set up. You’ll find them in deploy/build.properties (app-port, test-port, xcc-port)
  2. ./ml local bootstrap  <– use ml.bat if you’re on Windows
  3. ./ml local restart  <– if necessary — output from step 1 will tell you if you need to restart
  4. ./ml local deploy modules

Assuming all went well, you should now be able to point your browser to http://localhost:8022/test/ and run the csv-lib unit tests.

Tags: , , ,

4 Responses to “Converting CSV to XML”

  1. Arvind Says:

    what about quoted string i.e
    if i have line
    pin,Pilatus,”Pilatus “”quoted string”” mountain”,7/1/1984,7/31/1984,,”a word, and another”,dcassel

  2. Dave Cassel Says:

    Hmm, I don’t think the library accounts for that yet. I’ll add a unit test and get that fixed. Thanks for point it out!

  3. Dixit Says:

    Hi David,

    Thanks providing this library. (csv to XML)
    I got one more scenario.

    While converting to XML, is it possiblt to make sure all the generated XML files follow some particular schema. (.xsd)

    Thanks,
    Dixit

  4. Dave Cassel Says:

    Dixit, yes, that’s possible. See the Loading Schemas chapter of the Application Developer’s Guide, which includes a section on Validating XML Against a Schema.

    Another note: I wrote this post in 2012. Since then, MarkLogic has release the MarkLogic Content Pump, which includes the ability to read CSV files and turn them into flat XML. To provide more interesting structure (ie, to make it hierarchical), you can transform on ingestion. I highly recommend using that approach over using the library in the post.

Leave a Reply