Cleaning up original Burton "Kama Sutra" page scans -- need advice/help

2005-05-25 15:24:54

I'm now working to "clean up" the 182 page images from a recent scan of a very rare and noteworthy public domain book. The cleaned-up scans will be released to the public (such as given to the Internet Archive) for free access. [For those interested, the book is the 1885 second printing of the second edition of Sir Richard F. Burton's "Kama Sutra of Vatsyayana".]

The scans were done at 600 dpi (optical) 256-color greyscale (there's no color in the book), to capture sufficient fine-detail to aid in the cleanup process. Of course, the book was chopped (the binding was falling apart anyway) and each page scanned on a flat-bed, so there's no page distortion caused by trying to scan a bound book. There are no illustrations -- it's all black and white text.

I've already deskewed, cropped, centered and size-normalized all 182 pages. (For those interested, links to two sample partially-cleaned pages are given below.)

In the cleanup process, I'd like to convert what I now have into 600-dpi *bitonal* (black and white) with uniform and nicely readable character density, removal of "pepper", cleanup of larger blotches, etc. I recognize there will be some handwork required, particularly to remove larger "pepper" and blotches, and repair a few characters, etc., but of course want to minimize handwork.

[Note that the purpose of the cleanup is for direct human-use of the scans, and not solely for OCR purposes which doesn't require the planned level of cleanup. For example, I plan to produce a DjVu version for direct reading. For those who will probably ask, the raw page scans have already been uploaded to Distributed Proofreaders for conversion to structured digital text.]

Unfortunately, what complicates the clean-up process is that the original book is in poor and variable condition. The paper is quite yellowed and darkened, and many pages are quite faded. Were the original in mint condition with good, uniform ink-to-paper contrast, I wouldn't be posting this request for advice. But the overall poor quality and page-to-page variation is taxing my graphics abilities to produce a clean finished product with reasonably readable and uniform character density (at 600-dpi bitonal.)

Here are two sample pages, each about 4.5 megs in size (2550x3900 greyscale):

http://www.openreader.org/kamasutra/page031.png (good condition) http://www.openreader.org/kamasutra/page106.png (poor condition)

I would assume that others have had similar needs and have come up with various processing tricks and even built special tools to aid in the clean-up process (e.g., how to auto-remove small "pepper", the one to few pixel wide black spots on the white background?). I look forward to your advice and even help if you are interested (I will upload all the partially-cleaned images somewhere if you want to help with the actual clean-up process -- the whole set of images totals 680 megs.)

[As a final note, I use Paint Shop Pro 9, but do not have Photoshop. But since PSP9 is fairly powerful, I assume that many, if not all, recommended Photoshop processes will map over to PSP9.]

Thanks!

Jon Noring

@Lorem Ipsum

2005-05-25 16:18:14

"Jon" wrote in message

I'm now working to "clean up" the 182 page images from a recent scan of a very rare and noteworthy public domain book. The cleaned-up scans will be released to the public (such as given to the Internet Archive) for free access.

I only looked at the 'poor' example, page 106, which is all text, so I'll address that one: when we have exactly those cases, we use OCR unless there is some historical signifcance to the type-face (aka: font). That way you get perfect type. For illustrations, well you would have to show us one. Your server is pretty slow so downloading 9mb was discouraging enough that I'm moving on.

Rendering scanned text clearly is not a job for an image-processing program.

@David Littlewood

2005-05-27 14:06:08

In article , Lorem Ipsum
writes

"Jon" wrote in message

I'm now working to "clean up" the 182 page images from a recent scan of a very rare and noteworthy public domain book. The cleaned-up scans will be released to the public (such as given to the Internet Archive) for free access.

I only looked at the 'poor' example, page 106, which is all text, so I'll address that one: when we have exactly those cases, we use OCR unless there is some historical signifcance to the type-face (aka: font). That way you get perfect type. For illustrations, well you would have to show us one. Your server is pretty slow so downloading 9mb was discouraging enough that I'm moving on.

Rendering scanned text clearly is not a job for an image-processing program.

True; however, I did download both, and found that a very simple increase in contrast in PS (+75% for the "poor" image and +50% for the "good") gave perfectly readable images. It didn't remove the small blemishes, but I did not find them obtrusive. Saved as best quality jpegs, they took up only 220-350 kb each. I would imagine that for distribution a pdf file would be the most suitable. I'm certainly no Photoshop expert, but it took me about 1 minute each.

The trouble with OCR is that you will have to spend many days proof reading the output - and even then (if my experience is anything to go by) you won't catch all the silly errors.

David
--
David Littlewood

Cleaning up original Burton &quot;Kama Sutra&quot; page scans -- need advice/help

Cleaning up original Burton "Kama Sutra" page scans -- need advice/help