Scanning and Archival

I have a bunch of papers that are not valuable, but are rare and hard to find. Some I have only as faded Xeroxes, such as a copy of Anton Kutter’s treatise on Schiefspiegler telescopes, and Arnold Leonard’s work on Yolo telescopes. Others are in collections which are out of print, such as Tom Duff’s Polygon scan conversion by exact convolution. I’ve recently decided that I should scan papers that I think are interesting and preserve them in some digital form.

At the Hacker’s conference, Brewster Kahle touted the DjVu format as an alternative to pdf. For fun I decided to try to use my available tools to try to convert my bad xerox of Tom’s paper (15 pages) into a compact, reasonable form. I slopped all 15 pages into my super budget Canon scanner, and quickly converted them into 300dpi bilevel TIFF files. Each file was an 1,053,264 byte file.

The DjVu tools are open source, I got them by installing the djvulibre package in FreeBSD. The program cjb2 provides bilevel image compression of PBM files, so I made a little script that converted each page into a pbm, and then to a .djvu file. I specified that the compression could be lossy and that it should remove flecks, and then assembled them together using the djvm program.

The resulting file was 279,530, for all 15 pages.

That wasn’t quite good enough though, I decided to go ahead and use the online any2djvu server to perform OCR on the djvu file and stash it back inside. The resulting file with OCR is 313,192 bytes, and can be searched.

I then tried to make a pdf file out of it. I converted the djvu file into PostScript (using the djvups program) and then used ghostscript to convert it to a pdf file. The result wasn’t pretty: the PostScript file was 4,268,709 bytes and the resulting PDF file was 3,184,680 bytes, a 10 increase over the DjVu file. I have no doubt that Adobe Distiller could do a better job, but then, I don’t have Adobe Distiller.

Anyway, I thought it was a fun experiment. You can have a peek at the resulting DjVu file if you like. You can either install the viewer or if you have a Windows machine, install the plugin.