Notice: Undefined variable: REVISIONDATE in /home/tpg/public_html/gbook/index.php on line 28
The Technology Details

The Technology Details

This provides the details for my family history project.

I contacted Donna Budzier and sweet-talked her into sending me a small sample of her collection - just my particular branch (Martin Robert Gliedt). This gave me an excuse to play with my scanner and learn alot about the technologies I would need for the project. A few months later I'd managed to scan and catalog the images. And the rest, as they say, is history. Now over a year later I have completed scanning and cataloging some 400 pictures comprising over 40MB.

By this time I've done a great deal of Web work and learned about databases. This family history project consists of thousands of names - all inter-connected by a complex set of realtionships. It sounded like the perfect excuse for a real database project.

In the spring of 1996, I contacted Donna and asked for the source for her book. "Oh, I think we've lost that", she replied. I was crestfallen. Boy, this is going to be harder than I wanted.

So I was off to the local computer store and picked up OmniPage - an OCR scanning program. I had the insane idea of scanning in 600 pages of the book at creating text from the images. Well, I learned alot about how good and bad OCR software is - at least how far $125 will get you. And let me tell you, that isn't very far.

In many ways OmniPage was terrible - typical Windows software (in my experience). The program died with ease, requiring me to reboot the system to recover. But I kept at it, because OmniPage kept surprising me. After scanning a sample page, OmniPage would eventually present its work and I began my lessons:

  • Don't attempt a serious OCR project without the fastest processor you can afford. This OCR stuff doesn't take so much memory, but it is very CPU intensive. Pentiums level processors or above please.
  • OCR is all about character recognition -- and regardless how good the printed matter, it is filled with little blobs that make identifying individual characters difficult. So I went through the OmniPage spell checker - added a bunch of "new" valid words (like Gliedt) and now suddenly OmniPage could do a better job guessing what the characters were. Cool.
  • Lots of the names in this document are German - lots of a, o and u-umlauts and the OCR software was nicely ignoring all those umlauts as "noise". In poking through the menus, I found support for "languages". So I selected "German" (in addition to English) and to my astonishment, the umlauts were found. Really cool. Entering umlauts can be troublesome.
  • Even with spell check, the software was having alot of trouble with much of the text. So I put OmniPage into a mode where I could tell the software what character(s) a particular image really should be -- I could "train" the image to character mapping software. Once again the quality greatly improved.
  • OmniPage was great in that I could save the resulting text in a variety of convenient formats (Word, WordPerfect, and ASCII text). Very very nice.

Despite this progress, OmniPage still had a high error rate (5% or so). The document was riddled with lots of superscripts which caused OmniPage to do some very strange things with the font sizes. While I was making progress, this was beginning to look like a VERY long project. And then Donna sent me the source to her document.

The original book was done by WordPerfect. I cranked up Word and it read the WP files without a hitch. Word allowed me to save the documents in lots of formats - including HTML (using Internet Assistant for Word). Well, it only took one attempt to realize that this was a joke. The HTML it generated was mostly useless and I lost all sense of the format of the document. So I tried saving the documents in a variety of ASCII formats.

I found that saving the files in "Text Only" format preserved the paragraphs and produced a very consistent looking document. Being a programmer, I knew I could write a program to format this into HTML. So I got out my PERL manual and wrote a program of only 300 lines of code. This program converted the Text Only files into the HTML you read. It dynamically identifies the titles, body and end-notes sections. It finds all the end-note numbers (superscripts) and the "generation numbers" (also superscripts). Within a week, I was able to convert the text files into reasonable HTML. I added support for the figures and suddenly I had graphic pictures too. Major cool - and way past OCR-technology.

Continue with the family history project.