MarkLogic and OCR

Author: Dave Cassel  |  Category: Software Development

This is pretty rough, but I’m tired of it being a draft. Comment if something’s unclear and I’ll clean it up. 

This page is a tutorial on how to use OCR from MarkLogic. In particular, we’ll use MLJAM to call Tess4J, a Java wrapper around the open source tesseract-ocr. The tutorial assumes you’re running MarkLogic on Red Hat Enterprise Linux.

Background

MarkLogic has a lot of features, including xdmp:document-filter(), a function that finds text in binary documents. However, some documents just have scanned images and need OCR run on them in order to get any text.

MLJAM is a library that lets you run Java code from XQuery.

Configuration

There are several parts to this setup. I’m putting my files under ~/software/. Adjust accordingly if you use a different directory.

leptonica

leptonica is an image processing library that tesseract-ocr depends on, so we’ll build it first.

  • Download from leptonica.org
  • $ tar xf leptonica-1.69.tar.gz
  • $ cd leptonica-1.69
  • ./configure
  • make
  • sudo make install

tesseract-ocr

This is the core piece that will actually do the work — the rest of the pieces are just to provide access to this.

  • Download tesseract-ocr-3.02.02.tar.gz
  • Download tesseract-ocr-3.02.eng.tar.gz (or some other language piece, you’ll need at least one)
  • $ tar xf tesseract-ocr-3.02.02.tar.gz
  • $ tar xf tesseract-ocr-3.02.eng.tar.gz
  • $ cd tesseract-ocr
  • $ ./autogen.sh
  • $ ./configure
  • $ make
  • $ sudo make install
  • $ sudo cp tessdata/eng.traineddata /usr/local/share/tessdata/

This last command puts the English language-specific data in a place where tesseract-ocr will find it.

Tess4J

This is a Java JNA wrapper around tesseract-ocr.

  • download
  • extract
  • export LD_LIBRARY_PATH=/usr/local/lib
  • export TESSDATA_PREFIX=/usr/local/share
  • ant test

At this stage, the tests did have a couple errors for me (one error and one failure). That’s not ideal, but turned out to be good enough. If all the tests fail, then something is wrong.

Tomcat

  • Download tomcat
  • I typically run on a remote server where only ports 8001-8099 are unblocked, so I want to keep those ports available for MarkLogic. With that in mind, I modify Tomcat to not use ports in that range. To do so, modify server.xml. I change 8080 to 8180 and 8005 to 8105.
  • Tomcat’s default timeout is fairly small. It’s appropriate for normal web responses, but running OCR on some larger documents will take longer. I modified server.xml to make the timeout 10 minutes. We do this by editing the connectionTimeout attribute on the Connector element. It starts at 20000 milliseconds, or 20 seconds. To make it 10 minutes, I change the value to 600000.
  • Tomcat lets you create bin/setenv.sh with specific environment changes. We need to set up LD_LIBRARY_PATH, so that leptonica and tesseract-ocr will be visible (they both install to /usr/local/lib, which isn’t in the path on most systems), and TESSDATA_PREFIX, a variable needed for the tesseract-ocr library.
export LD_LIBRARY_PATH=/usr/local/lib
export TESSDATA_PREFIX=/usr/local/share
  • Start up Tomcat
$ bin/startup.sh

MLJAM

MLJAM has two parts: a Tomcat plugin and some XQuery code. The Tomcat plugin lets clients send Java code to Tomcat for evaluation. The XQuery code provides functions to interact with Tomcat.

  • Download from MISSING LINK and extract it in ~/software/mljam/.
  • Copy the MLJAM files to {$CATALINA_HOME}/webapps/:

$ cp -r ~/software/mljam/mljam /opt/apache-tomcat-7.0.35/webapps

  • The MLJAM zip has two .xqy files, jam.xqy and jam-utils.xqy. Add them to the MarkLogic application that needs to call OCR. (In my case, I added them to a Roxy project at /src/app/models/.) If necessary, deploy the code to your MarkLogic server (it won’t be necessary if your application server’s code is on the filesystem and you’ve copied the .xqy files to the app server’s directory).
  • Copy ~/software/Tess4J/dist/* and ~/software/Tess4J/lib/* to webapps/mljam/WEB-INF/lib
  • Restart Tomcat so it notices the libraries (I’m sure there’s a way to trigger Tomcat to do this without needing a restart, but I didn’t look into what it is).
$ cd $CATALINA_HOME
$ sudo bin/shutdown.sh
$ sudo bin/startup.sh

Calling Tesseract

Here’s an example of calling MLJAM with Tesseract after everything is set up:

xquery version "1.0-ml";
import module namespace jam = "http://xqdev.com/jam" at "/app/models/jam.xqy";

declare function local:do-ocr() as xs:string
{
  jam:eval('
    import java.io.File;
    import net.sourceforge.tess4j.*;
    //File imageFile = new File("/home/dcassel/Tess4J/eurotext.png");
    File imageFile = new File("/home/dcassel/PKT_VIEW8112cae1_561180.pdf");
    Tesseract instance = Tesseract.getInstance();
    javax.imageio.ImageIO.scanForPlugins();
    if (imageFile == null)
      result = "File is null";
    else if (instance == null)
      result = "instance is null";
    else
      result = instance.doOCR(imageFile);
  '),
  xs:string(jam:get("result"))
};

jam:start("http://localhost:8180/mljam/mljam", "", ""),
local:do-ocr(),
jam:end()

Ta-da!

Tags: , , ,

2 Responses to “MarkLogic and OCR”

  1. sam Says:

    I did everything till the last section(Calling Tesseract) and struck…

    Could you guide me how do I test this setup.. meaning… do I need to copy this code in to some .xqy file and then type jam:start.

  2. Dave Cassel Says:

    sam, you can run that XQuery code in Query Console to try it out. Just paste everything in that section from xquery version “1.0-ml” to jam:end() into a QC buffer and run it.

Leave a Reply