The Technology Details
This provides the details for my family history project.
I contacted Donna Budzier and sweet-talked her into sending me a small sample of her collection - just my particular branch (Martin Robert Gliedt). This gave me an excuse to play with my scanner and learn alot about the technologies I would need for the project. A few months later I'd managed to scan and catalog the images. And the rest, as they say, is history. Now over a year later I have completed scanning and cataloging some 400 pictures comprising over 40MB.
By this time I've done a great deal of Web work and learned about databases. This family history project consists of thousands of names - all inter-connected by a complex set of realtionships. It sounded like the perfect excuse for a real database project.
In the spring of 1996, I contacted Donna and asked for the source for her book. "Oh, I think we've lost that", she replied. I was crestfallen. Boy, this is going to be harder than I wanted.
So I was off to the local computer store and picked up OmniPage - an OCR scanning program. I had the insane idea of scanning in 600 pages of the book at creating text from the images. Well, I learned alot about how good and bad OCR software is - at least how far $125 will get you. And let me tell you, that isn't very far.
In many ways OmniPage was terrible - typical Windows software (in my experience). The program died with ease, requiring me to reboot the system to recover. But I kept at it, because OmniPage kept surprising me. After scanning a sample page, OmniPage would eventually present its work and I began my lessons:
Despite this progress, OmniPage still had a high error rate (5% or so). The document was riddled with lots of superscripts which caused OmniPage to do some very strange things with the font sizes. While I was making progress, this was beginning to look like a VERY long project. And then Donna sent me the source to her document.
The original book was done by WordPerfect. I cranked up Word and it read the WP files without a hitch. Word allowed me to save the documents in lots of formats - including HTML (using Internet Assistant for Word). Well, it only took one attempt to realize that this was a joke. The HTML it generated was mostly useless and I lost all sense of the format of the document. So I tried saving the documents in a variety of ASCII formats.
I found that saving the files in "Text Only" format preserved the paragraphs and produced a very consistent looking document. Being a programmer, I knew I could write a program to format this into HTML. So I got out my PERL manual and wrote a program of only 300 lines of code. This program converted the Text Only files into the HTML you read. It dynamically identifies the titles, body and end-notes sections. It finds all the end-note numbers (superscripts) and the "generation numbers" (also superscripts). Within a week, I was able to convert the text files into reasonable HTML. I added support for the figures and suddenly I had graphic pictures too. Major cool - and way past OCR-technology.