How to digitise your paper documents
Scan your documents into electronic format
By Jon L. Jacobi | PC World | Published: 16:06, 02 March 2011
The space required to store paper documents can be a problem. Digitising your documents renders them exquisitely portable, you can store an entire library on your ebook reader with ease. And because paper documents can be turned into editable computer documents, they become searchable.
Compare typing "Roosevelt" in a search field with spending all day scanning micro-fiche and old newspapers by eye to research the Square Deal or the New Deal. The digital document is a boon to researchers the world over.
You can store documents digitally in one of two ways, as images or as text files. Images require far more space, but retain the character and flavour of the original document. Converting a scanned image to a text or word processing file involves what's called optical character recognition or OCR. It's a bit of misnomer, since you're actually processing digital information, but the term has stuck.
If the original document was written by hand or is art, storing it as an image is generally more desirable, the style of the handwriting can be as meaningful as the words themselves. The other reason for storing handwritten documents as images is that there are no commercially available handwriting recognition packages that can interpret handwritten characters from scans. So far, it's a technology stuck in the PDA and tablet world.
Anne-Sophie Bellaud of Vision Objects (a purveyer of handwriting recognition software) explains that with tablets you know the order in which hand-printed or -scripted characters were entered. This provides huge clues for the software. Without an entry timeline, handwriting is not nearly as easy to recognise.
No matter which way you'll be storing your documents, as images or as text files, you'll need a scanner to digitise them. If you have relatively few documents to process, a multifunction printer or a dedicated flatbed scanner such as those discussed in "Digitise Your Pictures" will suffice. They're relatively slow, however and only the more expensive models have automatic document feeders to handle multipage documents.
Fujitsu ScanSnap S1500 is a compact scanner that can help make the job easy.Though pricey, sheet-fed scanners are just the ticket if you need to process a lot of documents. Units such as Fujitsu's ScanSnap S1500 and HP's ScanJet Professional 3000 scan both sides of a document at once and average 20 pages per minute or better.
I'll give the HP props for slightly more reliable paper feeding with mixed document types, but the Fujitsu has the superior, better-integrated software.
Most scanners ship with OCR software that you can install on your PC, but if yours lacks it, you can buy the software separately. ABBYY's FineReader 9 Express, Nuance's OmniPage 17 Standard and Adobe's Acrobat X Standard are all good choices. Nuance's PaperPort 12 Standard also scans, does OCR and adds document management features that make it easier to keep track of your documents. Less expensive versions exist for most of these programs, so slow your heart rate.
In my hands-on tests with clean 300-dpi scans, Acrobat did the best job of converting documents, followed closely by FineReader, and not so closely by OmniPage and PaperPort. But the latter three products did better with the three low-quality, 150-dpi scans that I included among my test documents.
For documents stored as images, 150 to 200 dpi is usually fine, but OCR software works much better with 300 dpi scans. Much depends on your needs. If you just want to retain legibility, you may be able to drop the dpi and reduce your storage requirements.
Several online services, such as www.free-ocr.com, www.newocr.com and www.ocronline.com, are good for small scale projects or one-offs. First you scan the original to your PC, then upload the document to the website.
The services have limitations: My tests yielded results that weren't very accurate. Also, only text is recognised, not lines and other page elements.
The first service mentioned above, Free OCR, is free, but files can be no larger than 2MB, and no wider or higher than 5000 pixels (about 150 dpi for a letter-sized page) and you can do no more than 10 uploads per hour.
Another service, www.newocr.com, is also free, but the interface is primitive. It does a much better job, though, of pulling text than free-ocr.com, and it allows documents up to 5MB in size.
Finally, www.ocronline.com requires creating a free account, but allows 4MB images (about 200 dpi per page) and up to 15 uploads per hour. You get 10 free credits, but after that you must pay for them. The site sells credits in varying quantities, from 50 for $3.95 (8 cents per page) up to 5000 pages for $49.95 (1 cent per page). I got good results with this service, which handles graphic elements as well as text, though it wasn't up to the standards of Acrobat X or FineReader 10.
There's nothing like the feel, smell and visual stability of a real book, but more and more people are happily reading virtual books using Kindles, Nooks, iPads and other devices. You simply can't beat their portability, and the texts are searchable.
It's even possible to have a decent reading experience on smartphones and iPods. I use the latter and no, the frequent page-turning does not bother me, though I'll undoubtedly go for something larger eventually. You can purchase most books from an online store, but you may have some books in your own collection that aren't available in digital format.
To convert a physical book into an ebook requires first scanning it page by page, and then for lack of a better term OCR'ing it. This is tedious at best, so use a fast scanner. If you are willing to destroy the book, or know how to rebind, use a sheet-fed scanner. Most of the aforementioned OCR programs have features that help organise the pages.
Once you have the text file (in PDF, Word or other format) in place, grab Calibre, a very capable and free ebook reader, organiser, editor and publisher. Convert the file to the format appropriate for your device: EPUB or PDF, say. Once you've created a viewable file, use a reader app such as Stanza to load the ebook onto your device. Your device or app must support side-loading, that is loading from a PC.