If you wandered into my office, you’d probably be shocked by the vast amount of just raw paper I have lying around. I scribble notes, I use photocopy machines, I print stuff all the time. And I have lots of books and magazine articles I’ve clipped over the years.
I’d like to do some optical character recognition and digitize these papers, but I frankly am too cheap to buy OCR software, and besides, most of it runs on that Microsoft Operating System that Shall Not Be Named, and I have foresworn that I shall not give Microsoft any more money. I’ve tried a couple of really awful open source solutions like gocr, but it’s frankly terrible. On the Google Blog, I recently read that Google had used the tesseract engine and had created a suite of applications called ocropus that could be used for the purpose.
I did a quick test. I scanned a page from an old Sky and Telescope article: it describes a lensless Wright Camera. It contains inset boxes and graphics, and I had found it a fairly challenging thing to extract almost anything out of. To run ocropus, i simply typed the command “ocropus ocr testocropus.png > testocropus.html” and got this result. It isn’t perfect, but it’s not that bad, and probably represents a pretty significant reduction in labor (at least, given the speed at which I can type) in terms of getting a reasonably accurate representation of the fairly difficult text.
If you have some ocr tasks, you might think about giving it a shot. I will be doing some more experiments, and will keep you all posted.
Addendum: A second test with a second scan of a less challenging paper, done at the recommended 300 dpi and the resulting output.