The Technology Details

This provides the details for my family history project.

I contacted Donna Budzier and sweet-talked her into sending me a small sample of her collection - just my particular branch (Martin Robert Gliedt). This gave me an excuse to play with my scanner and learn alot about the technologies I would need for the project. A few months later I'd managed to scan and catalog the images. And the rest, as they say, is history. Now over a year later I have completed scanning and cataloging some 400 pictures comprising over 40MB.

By this time I've done a great deal of Web work and learned about databases. This family history project consists of thousands of names - all inter-connected by a complex set of realtionships. It sounded like the perfect excuse for a real database project.

In the spring of 1996, I contacted Donna and asked for the source for her book. "Oh, I think we've lost that", she replied. I was crestfallen. Boy, this is going to be harder than I wanted.

So I was off to the local computer store and picked up OmniPage - an OCR scanning program. I had the insane idea of scanning in 600 pages of the book at creating text from the images. Well, I learned alot about how good and bad OCR software is - at least how far $125 will get you. And let me tell you, that isn't very far.

In many ways OmniPage was terrible - typical Windows software (in my experience). The program died with ease, requiring me to reboot the system to recover. But I kept at it, because OmniPage kept surprising me. After scanning a sample page, OmniPage would eventually present its work and I began my lessons:

Don't attempt a serious OCR project without the fastest processor you can afford. This OCR stuff doesn't take so much memory, but it is very CPU intensive. Pentiums level processors or above please.
OCR is all about character recognition -- and regardless how good the printed matter, it is filled with little blobs that make identifying individual characters difficult. So I went through the OmniPage spell checker - added a bunch of "new" valid words (like Gliedt) and now suddenly OmniPage could do a better job guessing what the characters were. Cool.
Lots of the names in this document are German - lots of a, o and u-umlauts and the OCR software was nicely ignoring all those umlauts as "noise". In poking through the menus, I found support for "languages". So I selected "German" (in addition to English) and to my astonishment, the umlauts were found. Really cool. Entering umlauts can be troublesome.
Even with spell check, the software was having alot of trouble with much of the text. So I put OmniPage into a mode where I could tell the software what character(s) a particular image really should be -- I could "train" the image to character mapping software. Once again the quality greatly improved.
OmniPage was great in that I could save the resulting text in a variety of convenient formats (Word, WordPerfect, and ASCII text). Very very nice.

Despite this progress, OmniPage still had a high error rate (5% or so). The document was riddled with lots of superscripts which caused OmniPage to do some very strange things with the font sizes. While I was making progress, this was beginning to look like a VERY long project. And then Donna sent me the source to her document.

The original book was done by WordPerfect. I cranked up Word and it read the WP files without a hitch. Word allowed me to save the documents in lots of formats - including HTML (using Internet Assistant for Word). Well, it only took one attempt to realize that this was a joke. The HTML it generated was mostly useless and I lost all sense of the format of the document. So I tried saving the documents in a variety of ASCII formats.

I found that saving the files in "Text Only" format preserved the paragraphs and produced a very consistent looking document. Being a programmer, I knew I could write a program to format this into HTML. So I got out my PERL manual and wrote a program of only 300 lines of code. This program converted the Text Only files into the HTML you read. It dynamically identifies the titles, body and end-notes sections. It finds all the end-note numbers (superscripts) and the "generation numbers" (also superscripts). Within a week, I was able to convert the text files into reasonable HTML. I added support for the figures and suddenly I had graphic pictures too. Major cool - and way past OCR-technology.

Continue with the family history project.