Multiple issues: Discussion #915

Kishlay-notabot · 2024-04-12T14:44:05Z

Kishlay-notabot
Apr 12, 2024

Hi, I have been working with Tesseract OCR since months in my project and while doing so, I've noticed many weird things which have perplexed me and I want to get an understanding of why is it happening.
To provide context for the errors, I'll have to link 2 resources,

In the above repository, the code crops out individual word images detected in the input image file, using canvas API.
These are the following errors which I want to put focus on:

bbox location inaccurate.

The text [if exists] in the input image, even if it gets recognized, the bbox data does not contain it, and it is out of bounds as in there is no relation between the detected text and the cropped image.

Confidence number [inaccurate?]

Sometimes when the recognized text matches exactly with the actual handwritten text, the confidence number weirdly is written 0?
You can check out the number of "0.00"s here in this json file, but yeah that wouldn't give you an exact measure of this error happening, sometimes it can be 0 because of an actual mismatch.

These 2 issues have made me scratch my brain from the inside. What can be the possible reasons for this, and can it be fixed by any tweaks?

Balearica · 2024-04-15T07:20:39Z

Balearica
Apr 15, 2024
Maintainer

Regarding the first issue--regarding bounding boxes--it is unclear to me whether you are referring to (1) the text is identified correctly but the bounding box is missing the entire word or (2) incorrect text is being identified in a region where no word exists.

Regarding the second question, while I'm not sure why certain words are assigned 0, I can confirm that confidence metrics reported by Tesseract are essentially useless on the level of individual words. This is unfortunately not fixable, and is not even unique to Tesseract. I benchmarked Abbyy confidence metrics at one point and found that the vast majority of low-confidence words were correct, and the the vast majority of incorrect words were high-confidence.

Metrics from OCR engines can be useful on a less granular level--a page with average confidence 0.95 will be significantly higher-quality than a page with average confidence of 0.80--however I don't think accurate metrics are possible on the word level. None of these programs have any robust way to evaluate themselves, so the confidence metrics are built using some internal metrics from the recognition process.

4 replies

Kishlay-notabot Apr 15, 2024
Author

Ok so the issue (1) is actually misplaced location of the detected word. Let's say I have an image with the word "dog" in it, the engine does recognise it, but when I crop out the word from the parent image using the associated bbox data, it turns out to be horribly out of place. I haven't done any dedicated testing to it but I'll try recreating it with some custom samples.

Kishlay-notabot Apr 21, 2024
Author

Also, when I run word level detection on handwritten samples, 90 percent of them are inaccurate and 10 percent spot on.
Would the accuracy change if the scope is zoomed out? As in detecting a block of text altogether.
You can see for yourself the live examples at dcda.io. Most of the images you will see would not match the guessed word.
Sorry but I forgot to mention that the site has non-english dataset being used.

Balearica Apr 21, 2024
Maintainer

There are a number of mechanisms that can cause recognition results to change when recognizing a block of text versus a single word. However, the changes will not necessarily be large in magnitude, and may not all be improvements. I have encountered cases where words are not recognized correctly when running on the entire page, however the same word recognizes correctly when run independently. Therefore, the only way to see if this would be productive in a particular case would be to try it.

Kishlay-notabot Apr 22, 2024
Author

Oh okay, So I think it is time to do some more tests, again. 🔢 Thanks.
During the development of the scripts I made to crop out individual words, I encountered nearly 20K images which were just unusable.. out of around 110k detected words. In order to filter out bad outputs, I added some thresholds like minimum image resolution [20x20] or minimum aspect ratio of word length to width. The recognition was being done images which had handwritten text on ruled sheets. So horizontal lines were a big part of false detections.
I still am not sure why issue 1 is so prominent in OCR results.
In the image below, sometimes the bbox data even corresponds to a bunch of words. I know that you alone cannot state the reason for all these anomalies, but it is definitely worth leaving in the public space so that we can ponder upon it.

Here's the image which is a bunch of words but the program was actually run on word level detection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple issues: Discussion #915

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Multiple issues: Discussion #915

Kishlay-notabot Apr 12, 2024

Replies: 1 comment · 4 replies

Balearica Apr 15, 2024 Maintainer

Kishlay-notabot Apr 15, 2024 Author

Kishlay-notabot Apr 21, 2024 Author

Balearica Apr 21, 2024 Maintainer

Kishlay-notabot Apr 22, 2024 Author

Kishlay-notabot
Apr 12, 2024

Replies: 1 comment 4 replies

Balearica
Apr 15, 2024
Maintainer

Kishlay-notabot Apr 15, 2024
Author

Kishlay-notabot Apr 21, 2024
Author

Balearica Apr 21, 2024
Maintainer

Kishlay-notabot Apr 22, 2024
Author