Text editing Articles in MoA Digital Library Collection.
Guide qiven below is a sequence of instructions, for editing and
proofing a generic MoA OCR text
pages into a nicely formated and readable
html text webpage.
You need to first get permission from MoA Digital Library
to post online.
- Print-out the scanned images of the pages (selecting View as 100%).
Save any pages with illustrations as .gif images.
- Download and Save As Text all the page OCR (selecting View as text).
- Use Global editing to eliminate multiple blank-Space (text line are displayed
centered and generally saved as such).
- Join long lines broken at end.
Delete any lines with text only from previous or subsequent
Articles on first and last pages.
Rectify any multi-column page recognition errors.
This most time consuming operation is not needed for single-column
or pages which have OCRed properly. Since it would have been already
done by the OCR program properly recognizing the Multiple Columns.
Faster option if Possible is to ReOCR the page
- Check left-hand edge of first column and edit lines to match printed page.
Words at the end of the last column with "-" have sometimes been merged
with beginning on next line in the first column.
- Add blank-Space in lines to align
the left edge of Second and any subsequent columns.
- Vertical cut the Multi-columns and attach to bottom of previous column.
Good Text editors allow column selection.
(e.g., in Xemacs use the Alt key and the cursor to select the
vertical text portion).
If any expert knows how to do it using the MS text editors
please let me
know explicit instructions to add to this document.
- Delete any leftover lines with text from previous or subsequent
Articles on first and last pages which could not be deleted before
separating the columns.
- Delete out extra lines with title text, page numbers etc., between pages.
Also edit-out the figure captions if text is illustrated.
- Join lines with words broken with a "-" at end of lines not recognized
by OCR software.
- Check with printout and add html paragraph breaks <p>
- Spell-check to correct most OCR errors.
- re-Wrap of fill lines to get a formatted
ASCII text.
- Printout and Proof read to correct any remaining errors.
The computer screen-image is far more clear than printed images of pages.
- Crop out any illustrations from saved .gif files of pages and save as
compressed (say 50%) .jpg images.
- Insert text in HTML template file and edit the
html to add any images and figure captions, to display on the webpage.
Any computer literate kid (or adult) should be able to figure out
in few minutes the basic HTML by View - Page Source, of a
webpage in simple HTML.
I hope reader is not brainwashed into thinking that this is a
complex task, needing to learn the syntax of menu driven commercial
webpage editors of Micro(brained)Software.
I am however have not convinced myself if it is not faster to directly
type, than edit the OCR output, particularly in cases where the OCR
has messed-up multiple columns. I hope the instructions above will
help avoid making some mistakes which can make the process take longer.
I did one article to be able to write and illustrate instructions
above and it took far too long, even with a good editor like EMACS.
A good typist can probably retype the article much faster.