Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Applications > Advice needed for OCR applications on OS X

Advice needed for OCR applications on OS X
Thread Tools
homgran
Junior Member
Join Date: Oct 2003
Location: UK
Status: Offline
Reply With Quote
May 15, 2008, 01:52 PM
 
Hey guys,

It's been many years since I've actually used an Optical Character Recognition (OCR) package - we're talking 1998 and an early version of OmniPage. Have things moved on since then?

From about an hour of Googling, I can see three dedicated professional offerings: OmniPage, ReadIris and Abbyy FineReader. However, the OS X versions of OmniPage and ReadIris have received pretty poor reviews (the OS X port of OmniPage, for example, doesn't even use its latest engine and is prone to frequent crashes), whilst FineReader is unavailable for OS X.

It's also (from what I can tell) pretty slim pickings on the freeware side of things. The most promising project is Google's Tesseract, which doesn't appear to be ready for the prime-time. The fact that its command line only doesn't bother me in the slightest. In fact, I often prefer command-line applications as they generally offer greater flexibility - especially when trying to execute large batch jobs.

Finally, Adobe Acrobat Professional offers its own OCR feature. I like the sound of the "Searchable Image" pdf format, which maintains the original [scanned] image but contains a hidden layer of text, thus making the image "searchable". I believe OmniPage also offers this feature.

Can anyone share their recent experiences with OCR applications on OS X? Ideally, I would like to be able to convert a very large collection of scanned documents to the searchable image pdf format. Sort of like my own personal "Google Books". Though this may be wishful thinking at the moment...

Many thanks,

-Matt
     
rehoot
Dedicated MacNNer
Join Date: Nov 2005
Status: Offline
Reply With Quote
May 18, 2008, 09:58 AM
 
I don't know about the others, but I have been using ReadIris, and I don't like it very much. I was scanning PDFs that I made with LaTeX (I have since switched to a new procedure that bypasses ReadIris) -- so the document has none of the imperfections of a paper scan, but It makes plenty of recognition errors regardless of the font that I use. Errors in italics text and bibliographies are the worst (and every one of my documents has these), but if it is plain text with no italics or funny mixtures of number/characters/words, it is not so terrible. If it is really stumped, it will prompt you to enter the letters that correspond to what it sees, but it doesn't always match subsequent letters to what you enter--and if you enter the wrong letters, the whole run is wacked.

ReadIris is quicker that retyping, but if the output has to be correct, it will still take lots of time to proofread. There are different options for formatting, but the one that preserves the original page layout puts each page in a frame so that you can't really edit the resulting RTF file/Word document (other than to tweak each individual page a little bit).
Mac Pro Quad: 2.66GHz; 4 GB Ram; 4x500GB drives; Radeon X1900, 23" Cinema Screen, APC UPS
PowerBook G4: 1.33GHz; 768MB Ram; 60GB drive
     
red rocket
Mac Elite
Join Date: Mar 2002
Status: Offline
Reply With Quote
May 19, 2008, 03:13 AM
 
Readiris also has this annoying 50-page limit, which may be irrelevant for paper scans but throws a spanner in the works if you want to OCR longer pdf books.
     
rehoot
Dedicated MacNNer
Join Date: Nov 2005
Status: Offline
Reply With Quote
May 19, 2008, 02:54 PM
 
yes, I use pdftk to split big files into 50 page chunks and run them separately.
Mac Pro Quad: 2.66GHz; 4 GB Ram; 4x500GB drives; Radeon X1900, 23" Cinema Screen, APC UPS
PowerBook G4: 1.33GHz; 768MB Ram; 60GB drive
     
homgran  (op)
Junior Member
Join Date: Oct 2003
Location: UK
Status: Offline
Reply With Quote
May 23, 2008, 08:12 AM
 
Hey guys, thanks for the feedback.

Sounds like ReadIris might not be the best - which is a pity since, according to Wikipedia, Acrobat Professional 8 employs the ReadIris OCR engine. I do have a number of mathematical documents which were created using LaTeX, and one author in particular has a habit of using italics to emphasise any key words. However, in general the text is very clean.

The page-limit could have been a bit of a problem in the event I wanted to scan larger documents (or magazines), so thanks for the work-around, rehoot.

I'll keep ReadIris in mind - I really appreciate the responses.

Does anyone have any experience with the most recent edition of OmniPage?

Cheers,

-Matt
     
rehoot
Dedicated MacNNer
Join Date: Nov 2005
Status: Offline
Reply With Quote
May 23, 2008, 08:39 AM
 
Originally Posted by homgran View Post
... I do have a number of mathematical documents which were created using LaTeX, and one author in particular has a habit of using italics to emphasise any key words. However, in general the text is very clean.
For math, you might scan the equations as graphics as opposed to characters, especially if there a lots of symbols and things. If you have PDFs, you could email one and I'll OCR it and show you what the output is.
Mac Pro Quad: 2.66GHz; 4 GB Ram; 4x500GB drives; Radeon X1900, 23" Cinema Screen, APC UPS
PowerBook G4: 1.33GHz; 768MB Ram; 60GB drive
     
homgran  (op)
Junior Member
Join Date: Oct 2003
Location: UK
Status: Offline
Reply With Quote
May 23, 2008, 11:54 AM
 
Originally Posted by rehoot View Post
If you have PDFs, you could email one and I'll OCR it and show you what the output is.
That would be fantastic. I have uploaded a 9-page snippet (containing a fair bit of mathematical annotation, and one example of italics text) here and, just for fun, a two-page scan of a fifteen-year-old magazine article here.

I'd be very interested to see how the OCR performs on each of these tasks. Intuitively, I would imagine the nine-page document to perform quite well, but the magazine scan to cause a quite a few problems (if it works at all - since there's some extreme cases of "text over coloured-gradients").

Thanks for your help!

-Matt
     
rehoot
Dedicated MacNNer
Join Date: Nov 2005
Status: Offline
Reply With Quote
May 23, 2008, 01:48 PM
 
Originally Posted by homgran View Post
That would be fantastic. I have uploaded a 9-page snippet (containing a fair bit of mathematical annotation, and one example of italics text) here and, just for fun, a two-page scan of a fifteen-year-old magazine article here.

There are two answers:
1) OCR is terrible for this kind or input. There are many math symbols and Greek characters on the same line as English characters.
2) If you have the LaTeX source and a working LaTeX install, download tex4ht and process the document to produce html (it makes a .html file and a .css file that contains the styles) and then view it with your web browser, select the body of the html document, copy it, and paste the resulting web page into Word. tex4ht will not draw graphics, but it will properly capture the math equations by making them images. You might have to redefine some graphics-related macros to be blank to avoid errors from tex4ht. My hyperref package didn't work well when I compiled tex4ht, even after adding the option: \usepackage[tex4ht]{hyperref}, so I redefined \url{} to be:
\def\url#1{\verb|#1|}


When you compile your LaTeX document with tex4ht... compile with latex instead of pdflatex.
You might also need to remove any hard-coded options for graphicx: (don't do: \usepackage[pdftex]{graphicx} ... maybe try blank or use "tex4ht")
The normal latex command should compile it, but after running latex you could also run the command:

htlatex YourFileNameNoExtension "xhtml,oofice" -cmozhtf -coo

then copy the .css and .html files together into their own directory if you need to move it.

Now here is the first test result from Readiris Pro 11 (it looks bad):
ocr_test_math_ann02ManualGraphicDetection.rtf - OCR, ReadIris, 11, example, latex, document

I converted one page after manually drawing graphics zones:

ocr_test_math_ann02ManualGraphicDetection.doc - OCR, ReadIris, 11, example, latex, document, manually, drawn, graphics, zones

In the second example I manually identified "text zones" and "graphics zones" so that the OCR knows when to interpret text and when to just snap a picture. They layout looks good in MS Word, but is messed up in OpenOffice 3 Beta for Mac. It was a pain making the second test because when there are symbols mixed with text I had to draw a text zone for the paragraph above the symbol, draw a text zone for the paragraph below the symbol, draw a text zone for the paragraph to the left of the symbol, draw a text zone for the text to the right of the symbol, and draw a "graphics zone" for the symbol (or symbols) itself.
( Last edited by rehoot; May 23, 2008 at 02:51 PM. )
Mac Pro Quad: 2.66GHz; 4 GB Ram; 4x500GB drives; Radeon X1900, 23" Cinema Screen, APC UPS
PowerBook G4: 1.33GHz; 768MB Ram; 60GB drive
     
rehoot
Dedicated MacNNer
Join Date: Nov 2005
Status: Offline
Reply With Quote
May 23, 2008, 02:42 PM
 
I posted test results for the LaTeX doc above, and the test on the magazine is still running on my old G4 PowerBook--for about 40 minutes so far. I expect the output to be terrible because OCR gets confused when there are mixed colors in the background. There are OCR programs that can read this. I looked through these pages once:

OCR Technologies | Homepage

and I think they can process all kinds of OCR inputs, but it costs about $100,000.
Mac Pro Quad: 2.66GHz; 4 GB Ram; 4x500GB drives; Radeon X1900, 23" Cinema Screen, APC UPS
PowerBook G4: 1.33GHz; 768MB Ram; 60GB drive
     
homgran  (op)
Junior Member
Join Date: Oct 2003
Location: UK
Status: Offline
Reply With Quote
May 23, 2008, 03:34 PM
 
Originally Posted by rehoot View Post
There are two answers:
1) OCR is terrible for this kind or input. There are many math symbols and Greek characters on the same line as English characters.
To be honest, I'm very impressed with the output. I didn't really expect the OCR to pick up any of the symbols, but it did a fantastic job of recognising the text. My main objective would be to make such pdf documents "searchable" (whilst maintaining the original image), and this ReadIris looks as though it would do a fine job of that!

Originally Posted by rehoot View Post
2) If you have the LaTeX source and a working LaTeX install, download tex4ht and process the document to produce html (it makes a .html file and a .css file that contains the styles) and then view it with your web browser, select the body of the html document, copy it, and paste the resulting web page into Word. tex4ht will not draw graphics, but it will properly capture the math equations by making them images. You might have to redefine some graphics-related macros to be blank to avoid errors from tex4ht. My hyperref package didn't work well when I compiled tex4ht, even after adding the option: \usepackage[tex4ht]{hyperref}, so I redefined \url{} to be:
\def\url#1{\verb|#1|}


When you compile your LaTeX document with tex4ht... compile with latex instead of pdflatex, then run tex4ht.'
You might also need to remove any hard-coded options for graphicx: (don't do: \usepackage[pdftex]{graphicx} ... maybe try blank or use "tex4ht")
then TRY THE COMMAND:

htlatex YourFileNameNoExtension "xhtml,oofice" -cmozhtf -coo

then copy the .css and .html files together into their own directory if you need to move it.
That's pretty neat - I had wondered how you might go about using LaTeX to generate a web page, but I'd never actually looked into it. I'll definitely make a note of this tip.

I retain the sources for all of my LaTeX papers, and my pdfs are searchable upon creation. However, there are always some pdfs that aren't searchable (because each page is an image), and this is where I would intend to use OCR.

Originally Posted by rehoot View Post
This one looks to be perfectly suitable for my purposes. At first glance, the actual text has been correctly recognised. The mathematical annotation is essentially unusable, but that wouldn't really be an issue for me since my objective is to search the text. For example, searching for "angular momentum" in the main text should get me close enough to the appropriate equation.

Originally Posted by rehoot View Post
I converted one page after manually drawing graphics zones:

ocr_test_math_ann02ManualGraphicDetection.doc - OCR, ReadIris, 11, example, latex, document, manually, drawn, graphics, zones

In the second example I manually identified "text zones" and "graphics zones" so that the OCR knows when to interpret text and when to just snap a picture. They layout looks good in MS Word, but is messed up in OpenOffice 3 Beta for Mac. It was a pain making the second test because when there are symbols mixed with text I had to draw a text zone for the paragraph above the symbol, draw a text zone for the paragraph below the symbol, draw a text zone for the paragraph to the left of the symbol, draw a text zone for the text to the right of the symbol, and draw a "graphics zone" for the symbol (or symbols) itself.
This one sounds like a lot of manual intervention, but I don't think the benefits are necessarily worth it. To go through an entire document, selecting all of the text along the way, would be very tedious - especially for long documents. Furthermore, each page would need to be laid out manually.

Thanks for going to so much effort, you've been a real help!

-Matt
     
homgran  (op)
Junior Member
Join Date: Oct 2003
Location: UK
Status: Offline
Reply With Quote
May 23, 2008, 03:46 PM
 
Originally Posted by rehoot View Post
I posted test results for the LaTeX doc above, and the test on the magazine is still running on my old G4 PowerBook--for about 40 minutes so far. I expect the output to be terrible because OCR gets confused when there are mixed colors in the background.
Yeah, I thought that might be the case. I deliberately chose those two pages, as they contain some areas that are somewhat difficult to read anyway (dark purple text on a purple background - yuck). It's certainly taking a long time to churn though - how fast is your G4, out of interest? I'm running a 1GHz TiBook here (soon to purchase a mini, whenever they get updated!).

I'll be interested to find out how it turns out (though if it's taking too long then please don't let it hold up your workflow or anything!).

Originally Posted by rehoot View Post
There are OCR programs that can read this. I looked through these pages once:

OCR Technologies | Homepage

and I think they can process all kinds of OCR inputs, but it costs about $100,000.
Sheesh, that's crazy! The magazine part of this would only be a hobby. For $100,000, they could probably hire a small group of people to transcribe the magazine - no need for computer automation!!

-Matt
     
rehoot
Dedicated MacNNer
Join Date: Nov 2005
Status: Offline
Reply With Quote
May 23, 2008, 05:32 PM
 
Since I posted other info on tex4ht, here is the last tip to get it to produce good quality image of the text equations. tex4ht uses program called convert that is created by Image Magick. You should download the new version (ImageMagick: Install from Binary Distribution) and then tweak one of the default settings. After tex4ht is installed, Look at your LaTeX folder to find the settings that is uses:
/texmf-dist/tex4ht/base/unix/tex4ht.env

I edited the file and changed ``110x110'' to ``300x300.'' The original lines began with ``Gconvert'' or ``Ggs'' where I think the G is trimmed before it runs. This changes the default resolution to 300 pixels per inch instead of 110.
Mac Pro Quad: 2.66GHz; 4 GB Ram; 4x500GB drives; Radeon X1900, 23" Cinema Screen, APC UPS
PowerBook G4: 1.33GHz; 768MB Ram; 60GB drive
     
rehoot
Dedicated MacNNer
Join Date: Nov 2005
Status: Offline
Reply With Quote
May 23, 2008, 05:35 PM
 
The test on the magazine page never finished. I tried it on my new computer it is also locked. I think that answers a question about how well it works on big images that have busy backgrounds (it might have worked if I did small chunks at a time??). The $100,000 dollar OCR systems are for banks and companies that use it to process thousands of documents per hour. I worked for a company that used it for processing billing information that was written by hand.
Mac Pro Quad: 2.66GHz; 4 GB Ram; 4x500GB drives; Radeon X1900, 23" Cinema Screen, APC UPS
PowerBook G4: 1.33GHz; 768MB Ram; 60GB drive
     
roderick99
Fresh-Faced Recruit
Join Date: Aug 2008
Status: Offline
Reply With Quote
Aug 22, 2008, 06:07 PM
 
Just to ensure everyone knows it is possible to batch ocr from adobe acrobat professional which ships with the Fujitsu Scansnap $510M
     
   
Thread Tools
 
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Top
Privacy Policy
All times are GMT -4. The time now is 01:51 AM.
All contents of these forums © 1995-2017 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.8 © 2000-2017, Jelsoft Enterprises Ltd.,