Blog

05
Nov

How we tuned Tesseract to perform as well as a commercial OCR package

Optical Character Recognition (OCR) is a widely used technology for extracting text from the scanned or camera images containing text.

 

One of our clients gave as a challenging task to see if we can improve the Tesseract Output somehow. They have been using Tesseract, but not with a satisfying performance or output. The challenge was to see if we can somehow improve the performance. After the steps outlined below, we were able to improve the accuracy by 52%.

 

 

This technology is widely used for  electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. And there are many open source and commercial OCR softwares available. One of the best Open source software is Tesseract OCR is comparable to commercial OCR softwares. That is why Tesseract is the best option for OCR Tasks when relying on Open source.

 

For a neatly scanned document, the character recognition process would be easy as pie. But when the case is, a receipt which is captured using a camera device, there would be problems like overexposure, underexposure, lighting condition varied throughout the image and many other worse conditions. In such images even the commercial OCR products won’t do well. After considering the conditions and comparing different OCR softwares, we found that no OCR software was good enough to give good results due to under exposure or over exposure of the images and to make the condition worse the unequal distribution of lighting over the image. Everyone should have come to experience this situation when taking a picture using a mobile camera.

 

The solution that we came with is to process the image for various level of filtering of image and making the image OCR friendly. What we found is a script which makes use of imagemagick named Textcleaner You can experiment with different options in this Script to find the best output.

What we did :

  • Compared the output of Tesseract output with another Commercial software.
  • Used Textcleaner with different options to enhance the image for making it more OCR friendly. (./textcleaner -g -e none -f 10 -o 5 Sample.jpg out.jpg)
  • Run OCR after processing with textcleaner, Compare output for different versions of textcleaner outputs.
  • Made 10 different versions of receipt image using Textcleaner with different filter settings, extracted the text from the OCR output of all 10 variants, Compared the output and derived the desired output using an algorithm and REGEX.
  • You can rely on different algorithm on extracting text like :
    1. One with most number of charecters
    2. One with most number of numbers
    3. One with most number of characters and numbers
    4. Or do REGEX on OCR outputs of all 10 versions of textcleaner output and select the best match.

Result :

We were able to get a better OCR output using the open source Tesseract . Even if we had to perform extra processing, the end result is comparable to Commercial softwares.

 

A sample of output obtained is compared in the table given below.


5 Comments for this entry

Dan G
May 26th, 2013 on 2:25 pm

Very nice article!
I’m looking to do large-scale OCR on thousands of JPEG scans of varying print quality, and this is exactly the kind of info I was looking for.
What I’m interested in knowing is if you found any one algorithm consistently more successful than the others? And which were the parameters of textcleaner that made more difference?
Because I have to automate tens of thousands of scans, I can’t do any manual string extraction to feed into REGEX matching. Of course I could use a dictionary for REGEX matches, leading to more complex, fuzzy solutions….
Any more advice (or code) appreciated!

    admin
    May 27th, 2013 on 10:34 am

    In our case, the answer was not a single option, but we had to use different options and use a test case to pick up the right option. So, it was like trial and error, for each input!

    The main thing we did is set the filter option in textcleaner. For every images we run filter value varying from 10 to 100 in addition of 10′s. No single parameter helped, in our case, as we were picking different values form different images. But using these options, we generated enough samples to pick the best ones.

Dan G
May 29th, 2013 on 3:53 pm

Thanks for the response.
I will try the -f 10,20,30…100 as you suggest – and then I will do a round of tests using algorithm 1,2 and 3 from your article – and pick the algorithm which best represents what a human would choose most often. I will then apply that algorithm choice (1,2,3) to the full set of tens of thousands of docs.
Probably it will not be the ideal algorithm in every single case – but it will probably be a lot better than the default Tesseract output in many cases.

(One other thing I was thinking about was applying some fuzzy regex matching to correct mis-spellings or at least give options …. but that’s another problem ;)

sumeet
December 20th, 2013 on 2:30 pm

Can you please provide me sample code(in context of android or java) of above pre-processing.

Nicholas Henry
January 21st, 2014 on 8:22 pm

Very helpful information! Did you vary the filter parameter, or were there other parameters such as sharpamt or bluramt that you found helpful? Thank you for sharing.