brainwagon

A quick test of ocropus

June 23, 2007 | General | By: Mark VandeWettering

If you wandered into my office, you’d probably be shocked by the vast amount of just raw paper I have lying around. I scribble notes, I use photocopy machines, I print stuff all the time. And I have lots of books and magazine articles I’ve clipped over the years.

I’d like to do some optical character recognition and digitize these papers, but I frankly am too cheap to buy OCR software, and besides, most of it runs on that Microsoft Operating System that Shall Not Be Named, and I have foresworn that I shall not give Microsoft any more money. I’ve tried a couple of really awful open source solutions like gocr, but it’s frankly terrible. On the Google Blog, I recently read that Google had used the tesseract engine and had created a suite of applications called ocropus that could be used for the purpose.

ocropus – Google Code

I did a quick test. I scanned a page from an old Sky and Telescope article: it describes a lensless Wright Camera. It contains inset boxes and graphics, and I had found it a fairly challenging thing to extract almost anything out of. To run ocropus, i simply typed the command “ocropus ocr testocropus.png > testocropus.html” and got this result. It isn’t perfect, but it’s not that bad, and probably represents a pretty significant reduction in labor (at least, given the speed at which I can type) in terms of getting a reasonably accurate representation of the fairly difficult text.

If you have some ocr tasks, you might think about giving it a shot. I will be doing some more experiments, and will keep you all posted.

[tags]OCR, OCRopus, Google Code[/tags]

Addendum: A second test with a second scan of a less challenging paper, done at the recommended 300 dpi and the resulting output.

Latest Comments

J. Peterson on makesite.py1/25/2025
I recall burning three or four weeks of a sabbatical getting Saccade.com on the air with Wordpress. So much tweaking…
david koblas on Notes re: WordPress vs. Hugo1/19/2025
I move my pretty useless blog to Hugo about 7 years ago, since I got frustrated at too many security…
David on Re: the $1 notebook1/17/2025
Something I used to good effect for a while was a "Pocketmod". You take a single page, fold it a…
Mark VandeWettering on Notes re: WordPress vs. Hugo1/16/2025
Bloat is a serious problem, to be sure, but I'm not aware of many modern programming languages that avoid it.…
wrm on Notes re: WordPress vs. Hugo1/16/2025
I'm running static pages (Notepad++) and a couple instances of Wordpress, and an instance of dokuwiki, all on ubuntu on…

A quick test of ocropus

About me…

Latest Comments